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Preface 



With computers becoming an integral part of virtually all activities in our 
daily lives, how can we have effective and efficient, or simply natural and en- 
joyable, interactions with computers? One of the most promising technologies 
is life-like characters - embodied agents apparently living on the screens of 
computational devices - that invite us to communicate with them in familiar 
ways and even establish socio-emotional relationships. 

While embodied agents are becoming increasingly important as virtual 
tutors, sales persona, guides, advisors, teammates, and personal representa- 
tives, the mechanisms and design issues underlying successful character-based 
applications are still not well understood and specification and representation 
languages for believable character behavior are hard to access and use. 

This book is a collection of both academic and corporate endeavors de- 
scribing powerful and mostly freely available scripting languages and tools for 
controlling life-like behavior and systems that demonstrate the potential of 
life-like characters in a wide range of application fields. Affective functions, 
the key ingredient of life- likeness, are a common topic of all chapters. The 
contributions have been carefully chosen and peer reviewed among authors. 
The chapter in Part I of this book introduces and motivates the topic, Part 
II contains chapters describing languages and tools for character control, and 
Part III is dedicated to character systems and applications. In Part IV, two 
leading experts in the field address the core themes of this book in the form of 
a synopsis. An Internet link to electronic material accompanying the chapters 
can be found in the Appendix. 

We would like to thank the Japan Society for the Promotion of Science 
(JSPS) for their generous support under the Research Grant (1999-2003) for 
the Future Program. Our special thanks go to Dr. Paolo Petta of the Aus- 
trian Research Institute for Artificial Intelligence who provided great encour- 
agement to publish a book on life-like characters. Thanks also to M.Eng. Ju- 
nichiro Mori for his patient assistance in the preparation of the manuscript. 




VIII Preface 



We are especially grateful to Ms. Ingeborg Mayer and Mr. Alfred Hof- 
mann of Springer- Verlag, for offering invaluable help and providing optimal 
conditions to publish this book. 

It is our hope that this book will serve as a valuable guide to the successful 
design of life-like character applications for researchers and practitioners alike. 
We also hope that the scripting languages discussed in the book will be a 
milestone in the development of a standardized language for life-like character 
control. Finally, we hope that our book will ignite the interest of the broader 
community concerned with making interactions with computers more natural 
by using life-like characters as the technology of choice. 

Tokyo, Helmut Prendinger 

June 2003 Mitsuru Ishizuka 
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Introduction 




Introducing the Cast for Social Computing: 
Life-Like Characters 



Helmut Prendinger and Mitsuru Ishizuka 

Department of Information and Communication Engineering 
Graduate School of Information Science and Technology 
University of Tokyo 

7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan 
prendingerOacm . org, ishizukaOmiv . t . u-tokyo .ac.jp 

Summary. Life-like characters are one of the most exciting technologies for human- 
computer interface applications. They convincingly take the roles of virtual presen- 
ters, synthetic actors and sales personas, teammates and tutors. A common char- 
acteristic underlying their believability or life-likeness as conversational partners is 
computational models that provide them with affective functions such as synthetic 
emotions and personalities, and implement human interactive behavior or presen- 
tation skills. In social computing , a paradigm that aims to support the tendency of 
humans to interact with computers as social actors, life-like characters are key. They 
may embody the interface between humans and computers, and thus improve the 
otherwise poor communicative capabilities of computational devices. 

The success of life-like character applications today relies relies on the careful 
crafting of their designers, mostly programmers. The wide dissemination of life-like 
character technology in interactive systems, however, will greatly depend on the 
availability of tools that facilitate scripting of intelligent life-like behavior. The core 
tasks include the synchronization of synthetic speech and gestures, the expression 
of emotion and personality by means of body movement and facial display, the 
coordination of the embodied conversational behavior of multiple characters possibly 
including the user, and the design of artificial minds for synthetic characters. 

In this chapter we will first describe what life-like characters are, and how they 
differ from related synthetic entities. We will then explain how life-like character 
technologies may change and improve the interaction between humans and com- 
puters. Next, we report on some of the most promising character scripting and 
representation languages as well as authoring tools currently available. After that, 
the most successful life-like character systems are briefly introduced, demonstrat- 
ing the wide range of applications where embodied agents are at work. Some final 
remarks on this highly active research field conclude this introductory chapter. 
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1 Introduction 

Life-like characters are synthetic agents apparently 
living on the screens of computers. An early charac- 
terization of life-like characters can be found in the 
work of Joseph Bates who refers to them as emo- 
tional [5] and believable [6] agents. Bates explains the 
notion of a “believable character” as “[...] one that 
provides the illusion of life, and thus permits the au- 
dience’s suspension of disbelief” [6, p. 122]. Follow- 
ing the vision of designing creatures that computer 
users are willing to perceive as believable or life-like, 
researchers use a variety of different terms: anthropo- 
morphic agents, avatars, creatures, synthetic actors, 
non-player characters, and embodied conversational 
agents [59, 22, 28]. While creation of most terms Fig * 1 * Animated a S ent 
is inspired by specific character applications, such as 
avatars for distributed virtual environments like chat 

systems or non-player characters for interactive games, some terms intend to 
draw attention to a particular aspect of life-like characters. Embodied con- 
versational agents, for instance, are characters that visually incorporate, or 
embody, knowledge about the conversational process [12]. 

To restrict the focus of our discussion, we will draw a line between life- 
like characters that are graphically represented, or animated (see Fig. 1), and 
robotic agents that are realized as physical entities to operate in the physical 
world [9] . The concept of “life-likeness” is certainly not restricted to animated 
agents. Dautenhahn [17], for instance, extensively discusses life-likeness in the 
context of robotic agents. A more subtle distinction concerns the restriction 
to animated rather than animate characters. According to Hayes- Roth and 
Doyle [28], animate characters share all the features of life-like characters 
except for their embodiment; that is, animate characters are not necessarily 
animated, but can still be perceived as perfectly life-like. 

Although life-likeness is often associated with a “life-like” appearance, an- 
imate characters highlight the importance of synthetic minds that give char- 
acters individual personality and emotions. Bates [6] draws on the experience 
of professional character animators [58] when he argues that the portrayal 
of emotions plays a key role in the aim to create believable characters. On 
a par with emotions, personality is key to achieving life-likeness. Trappl and 
Petta [59] dedicated an entire volume to illustrate the personality concept in 
synthetic character research. Emotion and personality are often seen as the 
affective bases of believability [42], and sometimes the broader term social (or 
“socially intelligent”) is used to characterize life-likeness [22]. The presum- 
ably most profound account of what it means for a character to be (or rather 
seem) “life-like” is given by Hayes-Roth [27], who suggests seven qualities of 
life-likeness. Characters should seem conversational, intelligent, individual, so- 
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cial, emphatic, variable, and coherent, which are distilled from Hayes- Roth’s 
experience with character-based systems in both academia and industry dur- 
ing one decade. Hayes- Roth also suggests a variation of the famous Turing 
test to evaluate the life-likeness of interactive characters. 

Characters can be life-like in a “human-like” or an “animal-like” way. 
While the design of human-like characters attracted the majority of re- 
searchers, there are also investigations on animal-like characters, especially 
dogs [8]. An ongoing debate concerns the issue whether the “life-likeness” 
of characters is more effectively achieved by realistic or cartoon- style agents. 
Research that aims to create virtual humans typically follows the realistic 
approach [13, 24], Thalmann et al. [57] even strive for photorealism. On the 
other hand, most characters developed in the context of entertainment and 
“infotainment” systems adhere to the approach that uses cartoon-style char- 
acters [2, 45, 23, 61]. While the design of realistic characters is a high research 
aim per se, they do not necessarily outperform cartoon-style characters in the 
perception of the user. As opposed to cartoon-style characters, users have high 
expectations of the performance of realistic looking characters, which bears 
the risk that even small behavior deficiencies lead to user irritation and dissat- 
isfaction. The question of realistic vs. cartoon-style agents can eventually only 
be decided empirically with respect to specific application scenarios. McBreen 
et al. [39], for instance, investigate the effectiveness and user acceptability of 
different types of synthetic agents. A related empirical question concerns the 
benefits of displaying characters as facial agents (“talking heads”), full-body, 
or “upper-body plus face” agents. 



2 Towards Social Computing 

Since human-human communication is a highly effective and efficient way of 
interaction, life-like characters are promising candidates to improve human- 
computer interaction (HCI). Embodied agents may use multiple modalities 
such as voice, gestures, and facial expression to convey information and regu- 
late communication. The work of Reeves and Nass [49] shows that humans are 
(already) strongly biased in interpreting synthetic entities as social actors even 
if they do not display anthropomorphic features - the Media Equation. The 
authors carried out a series of classical psychological tests of human-human 
social interaction, but replaced one interlocutor by a computer with a human- 
sounding voice and a particular role such as a companion or opponent. The 
results of those experiments suggest that humans treat computers in an essen- 
tially natural way - as social actors - with a tendency, for instance, to be nicer 
in “face-to-face” interactions than in third-party conversations. More support 
for this result is provided by Lester et al. [33] who investigated the impact of 
animated agents in educational software along the dimensions of motivation 
and helpfulness, and coined the term “persona effect”, “[...] which is that 
the presence of a lifelike character in an interactive learning environment - 
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even one that is not expressive - can have a strong positive effect on students’ 
perception of their learning experience” [33, p. 359]. 

There are hence strong arguments to make the interface social by adding 
life-like characters that have the means to send social cues to the user and 
possibly even receive such signals. However, it should not be concluded that 
all interfaces can be improved by making them social. As an approximation, it 
can be said that character-based interfaces are beneficial whenever the inter- 
action task involves social activity. Training, presentation, and sales certainly 
fall under this category. By contrast, there are computer-related activities 
that typically do not require social interaction. Building a spreadsheet, for 
instance, is a mechanical task and most users would not want to have a col- 
league watching or interrupting them [21]. The same may hold true for the 
presence of a synthetic agent. On the other hand, social encounters also in- 
clude information exchange between people that share similar interests, where 
life-like characters may act as match-makers at public meeting places. In or- 
der to support “community awareness” , Sumi and Mase [54] investigated this 
form of computer- mediated communication. 

As an HCI paradigm, the goal of character-based human-computer inter- 
faces seems to be diametrically opposed to that of the “disappearing com- 
puter” concept in ubiquitous and invisible computing [60] . Those technologies 
are intended to “weave themselves into the fabric of everyday life until they 
are indistinguishable from it” [60, p. 94]. By contrast, the power of character- 
based HCI derives from the fact that people know how to interact with other 
people by using the modalities of their body (voice, gesture, gaze, etc.) and 
interpret the bodily signals of their interlocutors. Hence, character-based in- 
terfaces aim at realizing embodied interaction and intelligence [12] rather than 
interaction with “invisible” devices (see also the Gestalt user interface concept 
of Mariott and Beard [35]). 

The vision of social computing is to achieve natural and effective interac- 
tion between humans and computational devices. As argued above, we believe 
that by employing life-like characters, social computing can be realized most 
efficiently. Social computing can be characterized as 

• computing that intentionally displays social and affective cues to users and 

aims to trigger social reactions in users; and 

• computing that recognizes affective user states and gives affective feedback 

to users. 

In this paradigm, life-like characters are seen as social actors, and hence as 
genuine interactive partners for a wide variety of applications, ranging from 
advisors and sales persona to virtual playfellows. A recent study in the so- 
cial computing paradigm is the “relational agents” described by Bickmore [7], 
where characters are in the role of assistants for health behavior change (exer- 
cise adoption). He characterizes relational agents as computational (typically 
anthropomorphic) artifacts “[...] intended to produce relational cues or oth- 
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erwise produce a relational response in their users, such as increased liking 
for or trust in the agent” [7, p. 27]. 

Besides displaying social cues, the second key premise for social computing 
is that life-like characters recognize social cues of their interlocutor, such as 
the affective state of the user. In this respect, social computing shares the 
motivation and goal of affective computing [44] . In the context of a tele- home 
care application, Lisetti et al. [34] take physiological signals of the user so 
that a life-like character may respond appropriately. Conati [16] suggests an 
animated agent that adapts its behavior according to assessed user emotions 
in the setting of an educational game. Prendinger et al. [47] conducted an 
experiment that utilizes biosignals of users to demonstrate the calming effect 
of emphatic character behavior. 

The related notion of “Social Intelligence Design” [41], on the other hand, 
emphasizes the role of the web infrastructure as a means of computer-mediated 
interaction, community building and evolution, and collective intelligence, 
rather than (social) human-agent interaction. A full-fledged theory of social 
intelligence (or computing) will have to combine both aspects: (i) macro-level 
social interactions in a community of human and virtual agents, and (ii) micro- 
level social interactions between human users and virtual agents as personal 
representatives of other community members. 



3 Authoring Life-Like Characters 

One of the most challenging tasks in life-like character research is the design 
of powerful and flexible authoring tools for content experts. Unlike animators, 
who are skillful in creating believable synthetic characters, non-professionals 
will need appropriate scripting tools to build character-based applications 
[50]. Animating the visual appearance of life-like characters and integrating 
them into an application environment involves a large number of complex and 
highly inter-related tasks, such as: 

• The synchronization of synthetic speech, gaze, and gestures. 

• The expression of personality and affective state by means of body move- 
ment, facial display, and speech. 

• The coordination of the bodily behavior of multiple characters, includ- 
ing the synchronization of the characters’ conversational behavior (for in- 
stance, turn-taking). 

• The communication between one or more characters and the user. 

Observe that the mentioned tasks already assume that characters can be con- 
trolled at a rather “high” level, where designers may abstract from low-level 
concerns such as changing each individual degree of freedom in the charac- 
ter’s motion model. The Character Markup Language (CML) contains both 
low-level and medium-level tags to define the gesture behavior of a character 
as well as high-level tags that define combinations of other tagging structures 
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[3]. Furthermore, CML allows one to define high-level attributes to modulate 
a character’s behavior according to its emotional state and personality. The 
Virtual Human Markup Language (VHML) provides high-level and low-level 
tagging structures for facial and bodily animation, gesture, speech, emotion, 
as well as dialogue management [36]. The Scripting Technology for Embod- 
ied Persona (STEP) language contains high-level control specifications for 
scripting communicative gestures of 3D animated agents [29] . Being based on 
dynamic logic [25], the STEP language includes constructs known from pro- 
gramming languages, such as sequential and non-deterministic execution of 
behaviors or actions, (non-deterministic) iteration of behaviors, and behav- 
iors that are executed if certain conditions are met. 

The human interpretation process is very sensitive to and easily disturbed 
by a character’s “inconsistent” or “unnatural” behavior, whatever type of 
“nature” (realistic or not) is applicable. The challenge here is to maintain 
consistency between an agent’s internal emotional state and various forms of 
associated outward behavior such as speech and body movements [24]. An 
agent that speaks with a cheerful voice without displaying a happy facial 
expression will seem awkward or even fake. Another challenge is to keep con- 
sistency of agents over time, allowing for changes in their response tendencies 
as a result of the interaction history with other agents [46, 7]. 

Allbeck and Badler [1] developed an extensive framework for representing 
embodied characters and objects in virtual environments, called Parameter- 
ized Action Representation (PAR). This representation allows one to specify 
a large number of action parameters to control character behavior, including 
applicability conditions, purpose, duration, manner, and many more. Most 
notably, character actions can by modulated by specifying affect-related pa- 
rameters, emotion, and personality. In order to achieve a high level of nat- 
uralness in expressive behaviors, the authors developed the EMOTE system 
which is based on movement observation science. With respect to conversa- 
tional behavior, Cassell et al. [15] propose the BEAT (Behavior Expression 
Animation Toolkit) system as an elaborate mechanism to support consistency 
and accurate synchronization between a character’s speech and conversational 
gestures. The BEAT system uses a pipelined approach where the Text-to- 
Speech (TTS) engine produces a fixed timeline which constrains subsequently 
added gestures. The meaning of the input text is first analyzed semantically 
and then appropriate gestures are selected to co-occur with the spoken text. 

Most approaches to scripting virtual environments focus on designing the 
characters themselves and interactions between characters and virtual objects, 
with rudimentary consideration of the representation of interactions among 
characters and the user. The motivation for the Affective Presentation Markup 
Language (APML) developed by De Carolis et al. [18] is communicative func- 
tions, which make the language similar to the BEAT system [15]. In addition 
to turn-taking behavior, APML includes the speaker’s belief state (certainty of 
utterance) and intention (request, inform). The work of Mateas and Stern [38] 
broadens the spectrum of character scripting to interactive scenario scripting 
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to also include another agent and a human user. The authors propose ABL 
(A Behavior Language), a language that allows one to author believable char- 
acters for interactive drama. Unlike most other scripting approaches, which 
are XML-based [15, 18], ABL is a reactive planning language with charac- 
ter behaviors written in a Java-style syntax. Most notably, ABL may encode 
“joint plans” that describe the coordinated behavior of characters as one en- 
tity rather than having autonomous characters work out a joint plan (which 
would require complex reasoning, message passing, and so forth). However, 
joint plans are still reactive, letting the user interfere with plan execution 
during interaction. 

The next step in providing support for creating life-like character appli- 
cations for non-specialists is character toolkits that address the needs of con- 
tent providers. The Multi-modal Presentation Markup Language (MPML), 
for instance, has been designed so that ordinary people can write multi-modal 
character contents most easily like they write a variety of web contents us- 
ing HTML. Moreover, MPML offers a visual editor that allows one to script 
interactive multi-character presentations in a drag-and-drop fashion using a 
graphical representation of the presentation flow [48]. MPML also provides 
an interface to the Scripting Emotion-based Agent Minds (SCREAM) system 
that enables authors to specify the propositional attitudes and affect-related 
processes of a character’s (synthetic) brain [48]. While MPML typically uses 
the Microsoft Agent package to control animated characters [40], the Galatea 
software toolkit allows authors to personalize core features of a facial spoken 
dialogue agent [31]. Galetea consists of interfaced modules that are all modifi- 
able: speech synthesizer, speech recognizer, facial animation synthesizer, and 
task dialogue manager. As described above, the BEAT system is a toolkit 
to synchronize analyzed speech automatically with non-verbal behaviors [15]. 
The toolkit is extensible, and new rules encoding linguistic and contextual 
analysis of textual input are easily added. 

Another challenge for character-based applications is to adequately ac- 
count for the user’s behavior, in particular the user’s affective state [44]. 
Marking up user input modalities rather than character (output) modalities 
is a hitherto entirely unexplored application of scripting technology. Mariott 
and Beard [35] propose a “complete user interaction” paradigm which they 
call “Gestalt User Interface ... an interface that should be reactive to, and 
proactive of, the perceived desires of the user through emotion and gesture” . 
User interaction modalities such as speech, facial expressions, and body ges- 
tures are analyzed and then transformed to an XML structure that can be 
“played back” by a VHML-based talking head or provide the conditions to 
decide on the desired character response. 

Rist [50] offers interesting reflections on scripting and specification lan- 
guages for life-like characters. He proposes objectives and desiderata for the 
design of character languages and discusses the state of current developments 
in view of the potential (and highly desirable) standardization of scripting 
languages. Rist also points out limitations of the present focus on XML-based 
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languages and suggests drawing inspirations from the area of network proto- 
cols in order to manage more complex and sophisticated character interac- 
tions. 



4 Life-Like Character Applications and Systems 

Recent years have witnessed a considerable and growing interest in employing 
life-like characters for tasks that are typically performed by humans. In the 
following, we list some of the more prominent deployed character applications 
as well as systems in progress. Issues of designing life-like characters and 
lessons learned can also be found in Hayes- Roth [27]. 

Life-like characters are used 

• as (virtual) tutors and trainers in interactive learning environments [20, 
30, 26, 16, 37, 56], 

• as presenters and sales persona on the web and at information booths 
[11, 4, 48, 51], 

• as actors for entertainment [52, 10, 43], 

• as communication partners in therapy [19, 34, 37], 

• as personal representatives in online communities and guidance systems 
[14, 55, 53], and 

• as information experts enhancing conventional web search engines [32]. 

One of the most successful application fields of life-like character technology 
is computer-based learning environments where embodied agents can perform 
in a variety of student-related roles, especially as tutors and trainers [20, 30, 
26, 16, 37, 56]. Marsella et al. [37] describe a Mission Rehearsal Exercise 
(MRE) system for training peacekeeping missions where a realistic virtual 
human acts as a sergeant in the role of a mentor or as a soldier in the role 
of a teammate. In order to support highly believable, responsive, and easily 
interpretable behavior, the authors base their characters on an architecture for 
task-oriented behavior (STEVE), rich models of (social) plan-based emotion 
processing (Emile), and emotion appraisal and coping behaviors (Carmen’s 
Bright IDEAS). The MRE system is currently one of the most impressive 
applications of life-like character technology. 

Another application field where life-like characters showed significant 
progress is character-based presentation , especially online sales [11, 4, 48, 51]. 
Starting with the PPP Persona, Rist et al. [51] developed a series of increas- 
ingly powerful character technologies for a wide variety of agent-agent and 
human-agent interaction scenarios, such as the AiA travel agent, the eShow- 
room, a RoboCup commentator system (Gerd & Matze), a negotiation dia- 
logue manager (Avatar Arena), the MIAU platform for interactive car sales, 
and the interactive CrossTalk installation featuring two presentation screens. 
The work on life-like characters done at DFKI [51] can be seen as the strongest 
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and most covering in the field. While being well motivated and based on psy- 
chological and socio-psychological research, it offers powerful technologies for 
every imaginable interaction mode with and among life-like characters. As pre- 
viously mentioned, Prendinger et al. [48] developed two scripting tools that 
focus on creating interactive presentations (MPML) and affect-driven charac- 
ters (SCREAM). Both technologies are designed for web-based applications 
that require multiple character interactions including communication with the 
user. The implementation of an interactive casino scenario demonstrates the 
power and flexibility of this approach. 

One of the most attractive application fields of life-like characters is the en- 
tertainment sector where characters perform as virtual actors [52, 10, 38, 43]. 
Paiva et al. [43] provide a useful classification of character control technologies 
for story and game applications, based on the autonomy dimension. Besides 
character related autonomy - (partially) scripted, directed, role constrained, 
and autonomous - the authors also propose a classification of a user’s control 
over characters, that is puppet-like control, guidance, influence, and god-like 
control. The suggested classification is exemplified by a series of installations: 
Tristao and Isolda, Papous, Teatrix, FantasyA, and SenToy. Burke [10] de- 
scribes a powerful architecture that meets the demands of life-like characters 
for entertainment systems. In particular, he proposes a prediction-based ap- 
proach that allows for new types of learning and adaptive characters. The 
previously mentioned work of Mateas and Stern [38] implements an interac- 
tive drama - Fagade - a real-time 3D interactive drama that demonstrates 
the capabilities and promise of characters in entertainment systems. 

Life-like characters will also play a major role as communication partners 
in therapeutic and medical applications [19, 34, 37]. For instance, Marsella 
et al. [37] propose a system called “Carmen’s Bright IDEAS” (CBI) where 
users are immersed in a story that features an animated clinical counsellor 
and another agent that receives help and is designed to have problems similar 
to the user who interacts with the CBI system. The user may influence the de- 
velopment of the councelling session by selecting interface objects (“Thought 
Balloons”) that match his or her current feeling most closely. 

The great popularity of Internet-based and computer-mediated commu- 
nications raises the demand for life-like characters that function as personal 
representatives of users in online communities (for instance, chat systems) 
and guidance systems [14, 55, 53]. Sumi [53] developed the AgentSalon sys- 
tem where a visitor to an exhibition is equipped with a PalmGuide that hosts 
his or her personal agent which may migrate to a big display - then being 
visible as an embodied character - and start conversing with personal agents 
of other visitors. Since the agent stores a user’s personal interest profile, the 
conversation between the personal representatives can reveal shared interests 
and trigger a conversation between visitors. 

A common and one of the most important activities on the web is the 
retrieval of relevant information. Life-like characters have recently also been 
successfully employed to add value to search engines. Kitamura [32] describes 
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the Multiple Character Interface (MCI) system that aims at assisting users in 
the information retrieval task. Two MCI-based prototype systems are a co- 
operative multi-agent system for information retrieval (Venus and Mars) and 
a competitive multi-agent system for information recommendation (Recom- 
mendation Battlers) [32]. 

The following system can be viewed as a feasibility study on the next gen- 
eration of natural language understanding systems, including entertainment 
and helper robots, tutoring, and virtual space navigation systems. Tanaka et 
al. [56] developed a system called “Kirai” which allows one to direct virtual 
characters in a 3D environment. Most notably, the system incorporates a nat- 
ural language recognition and understanding (NLU) component so that char- 
acters can be instructed to perform actions in virtual space via speech input. 
Speech analysis includes syntactic and semantic analysis, anaphora resolution, 
ellipsis handling, and a simple mechanism to eliminate the vagueness problem 
of natural language. 



5 Concluding Remarks 

In this introductory chapter, the state of the art of life-like character script- 
ing languages and applications has been briefly reviewed. While the future 
of embodied characters remains to be seen, the extensive research on charac- 
ter representation languages and scripting tools certainly indicates a growing 
demand for embodiments of the human-computer interface. The most con- 
vincing evidence for the continued interest is the large number of deployed 
and upcoming character applications in a wide variety of applications, from 
learning and entertainment to online sales and medical advice. 

Life-like character research lays the foundations of the social computing 
paradigm, where computers deliberately display social cues and trigger social 
reactions in users. In order to pass as genuine social actors, life-like characters 
will eventually also have to be equipped with means to recognize social and 
affective cues of users, a research topic which we hope to address in a future 
publication. Although we focused on animated characters here, many of the 
insights gained can be transferred to the physical siblings of animated charac- 
ters, namely robotic agents. Animated or robotic, the success of those agents 
will ultimately depend on whether they are life-like. 
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Summary. Creating or adopting a representation of human actions or behaviors 
whether for simulations, web applications, tutoring agents, training scenarios, or 
the numerous other uses for virtual agents, requires an examination of the features 
needed for your application. Often a balance must be struck between the control a 
user has over the virtual agents and the amount of intelligence or autonomy they 
possess. Likewise, representation level(s) must be determined: is a graphical level 
representation needed or is a higher artificial intelligence level more appropriate? In 
this chapter, we briefly discuss some of these options and present our Parameterized 
Action Representation (PAR). 



1 Introduction 

You are in a room looking out a window. You see the busy streets below: 
traffic flowing through the streets and pedestrians hurrying off to work. But 
something seems odd. All of the cars are moving in the same direction, and all 
the pedestrians are moving along the same path too. You turn to look at the 
room. It appears to be a nice conference room: wood paneling, industrial blue 
carpeting, a large wood table surrounded by chairs. In one corner of the room, 
you notice two people conversing. From the dialog, you understand that a boss 
is reprimanding an employee. What is strange are the postures and gestures. 
They resemble a parent scolding a child. 

Then you notice a pot of coffee on the table. You decide to go and get a 
cup. You begin walking in the direction of the coffee, but soon encounter a 
chair in your path. You try to maneuver around it, but have to stop abruptly 
to prevent crashing into it. You alter your direction, take a step and stop 
again. Frustration builds as you continue this process until you are standing 
in front of the coffee. You reach for the coffee, but your hand does not find 
the pot. Your hand only seems to move to and from a spot to the right of 
the pot. Your anger builds, but it is outweighed by your desire for coffee. You 
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turn to your left, take a small step, and turn again to face the coffee pot. This 
time when you reach for the pot your hand penetrates the pot without moving 
it. As your frustration reaches a new level, you notice a face reflected in the 
surface of the pot. The face is smiling broadly. You realize the face is yours. 
How could you be smiling when you feel so frustrated? Is this is a nightmare? 
No, this is a virtual environment with an inadequate action representation. 

The world is complex and difficult to represent. Throughout history, artists 
and writers have depicted worlds with varying levels of realism, but expecta- 
tions are different in virtual reality. People spend their lives learning what to 
expect of their environment in terms of both objects and other people. When 
these expectations are not met, they become frustrated or confused. Unless 
the environment is designed as a fantasy world, the same expectations hold in 
virtual reality environments. People expect recognized objects to behave as in 
the real world with included randomness where appropriate. They expect to 
be able to navigate through the world in a natural fashion. They expect other 
inhabitants of the environment to behave naturally and natural interactions 
with both objects and people in the world. When these expectations are not 
met, participants lose their sense of immersion in the world. 

In order to create an interactive world that meets natural expectations, 
a substantial amount of computer software engineering is required: graphical 
depictions, motion models or generators, collision detection and avoidance, 
communication or synchronization channels, planning and navigation, cogni- 
tive modeling, psychosocial and physiological modeling and more, depending 
on the scenario. 

The construction of these components can be facilitated by creating or 
using an action representation. Over the last few years, many representations 
have been created. Some representations focus on conversation [13], others 
on networked simulations [34, 12], others on aspects of computer graphics 
and animation [29], and others on logic and planning [33]. There are now 
also meta-representations such as XML [30]. In this chapter, we will outline 
some things to consider when adopting an action representation. Then we will 
present a representation we developed, the Parameterized Action Represen- 
tation (PAR). 



2 Control vs. Autonomy 

Computer animation originated as key-frame animation. This provided an- 
imators with detailed control over the movements of the characters. Unfor- 
tunately, it is also a time consuming process that requires the storage of a 
large amount of data, and is often specific to a character. Additionally, when 
motions for virtual characters are specified on a frame-by-frame basis, they 
then become dependent on the context for which they were designed. 

There are a number of problems that arise when motions cannot be altered 
to context. First, interactions with objects and other agents become difficult, 
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if not impossible. When motions are specified as a combination of joint an- 
gles, the relative locations of objects and agents must be precise in order, for 
example, for an agent to reach an object. Additionally, the object size must 
be known a priori in order for the agent to grasp the object accurately. 

When only joint angle information is available for a motion, it is also dif- 
ficult to create transitions between motions. Several interpolation techniques 
have been developed, but more sophisticated methods are needed for tran- 
sitions between actions with severely differing postures, such as sitting and 
standing. Using joint angles as a representation implies knowing the beginning 
position of all the joints and being able to transition to them. 

Finally, expressivity is also a part of context. One of the advantages of 
using joint angle data is the control the animator has over the motion. A 
skilled animator can depict the character’s inner state (e.g. psychological and 
physiological state). However, when trying to use joint angle data in varying 
contexts, it is difficult to alter the expression of the motion to new contexts. 
Naturally, motion data could be stored for many expressive states, but it 
requires a lot of storage space and some mechanism for recalling the motion 
file required for a given context. 

Over the years, techniques have been developed that decrease the data 
needed for actions, enable context-sensitive actions, and increase agent auton- 
omy. Tools such as inverse kinematics [22, 37] enabled agent motion generators 
to perform more accurately in varying contexts, in some sense enabling the 
agent to determine precisely how an action should be performed. The increase 
in autonomy and decrease in the need for data specificity came at the cost of 
control. 

Another cost was naturalness or realism. Skilled animators could create 
motions that were both realistic and expressive. The newer techniques resulted 
in robotic motions. Luckily these techniques occurred at a time when the 
appearance of the characters was also unnatural. The motion quality of the 
characters met expectations given their appearance, but as graphics hardware 
advanced so did the characters’ appearances and the expectations of motion 
quality. 

Motion capture again increased the quality of motions, but at the cost 
of data size and context sensitivity. Today motion capture remains the best 
method for achieving natural human animations. Techniques have been and 
are being developed to make motion capture data more pliable and more 
context sensitive [9, 8, 20, 24, 23]. 

There are proprietary representations for information at this level (e.g. Jack 
[16] and DI-Guy [11]), but there are also pseudo standards, such as MPEG-4 
and Biovision (BVH). As a base-level representation such data may forever 
have its place. However, it will need to be combined with higher level data in 
order to provide proper transitions, added expressivity, and planning. 
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3 AI-Level Representation 

We have discussed some aspects of low-level motion representations. Now we 
turn to high-level representations. Low-level representations are used to de- 
scribe the movement of the characters. High-level representations can vary 
in their purpose and therefore their semantics. Much work has been done 
in representations for communicative or conversational agents [13], including 
some described in this volume. Their representations include mechanisms to 
synchronize facial expressions with speech. Some systems using these repre- 
sentations even extract semantic information from text to drive the display of 
the character and plan dialog [14]. These representations and systems gener- 
ally concentrate on agents interacting with a live participant and not a virtual 
world. 

Other autonomous agents are created to perform autonomously in a vir- 
tual world. Research in this area concentrates less on dialog and natural verbal 
communication and more on an agent’s interactions and autonomy in a virtual 
world. These are the types of agents we will focus on for the remainder of this 
chapter, but note that there is nothing to preclude using more than one rep- 
resentation in an agent or merging representations in order to create an agent 
with the ability to converse and behave autonomously in an environment. 

Planning for characters in virtual environments comes in a variety of forms. 
Reach planning is determining a path for an end-effector from a start position 
to a goal position, sometimes through a confined area [25]. Path planning and 
navigation involve determining a path for a virtual character to maneuver 
from one position to another in an environment [27, 32]. AI- level planning is 
determining what action a character should perform next in order to obtain a 
predefined goal [33]. All levels of planning require some representation of the 
state of the environment and become more challenging when the environment 
is dynamic. Hence, in order to take full advantage of an action representation, 
objects must also be represented. In fact, object representations may include 
varying levels of data just as action representations. The position and orienta- 
tion of an object may be updated at every frame, but the object representation 
may also include data about its utilities, such as a door being a pathway, a 
knife being a cutting tool, or a car being a mode of transportation. This high- 
level information can then be used in Al-level planning. Of particular interest 
in this level of planning are the effects of an action on objects. For example, 
representing that when an object is picked up, it is in the possession of the 
agent who picked it up and that there are implications about the global posi- 
tion of this object when the agent performs a subsequent translatory action. 
Such reasoning can be done when an action representation includes an object 
representation. 

The agents themselves may be considered special types of objects and also 
have a representation. As such they would have all the same fields as objects, 
but have a few additional entries. This opens a relatively new area in mod- 
eling virtual agents: cognitive and social modeling [38]. Agents’ personalities, 
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emotional states, goals, motivations, and more can be stored in an agent repre- 
sentation. Thus, in addition to the agent’s next action being planned, its next 
goal may be determined, the expression or motion quality of its actions may 
be chosen, and cooperation or coordination between agents may be enacted. 



4 Network Simulations 

When creating a framework for distributed or networked simulations, many 
design dimensions must be considered: bandwidth, synchronization, agent au- 
tonomy, agent control, latency, visualization, and interfaces [34]. Often these 
considerations are diametrically opposed. We must balance, for example, the 
amount of control that we have over agents in the simulation with the amount 
of bandwidth that we have to control them. Early networked simulations 
broadcast position and orientation data over the network to every agent at 
every frame or clock tick. This enormous amount of information overwhelmed 
the available bandwidth, but gave the simulation designers great control. Later 
predictive methods such as dead reckoning were used to limit the packet fre- 
quency requirements and thus better utilize bandwidth, but at the expense 
of accuracy. Advancements in networking techniques and hardware have in- 
creased bandwidth, but our expectations of simulations have also increased. 
Emerging research techniques from computer graphics and AI can be applied 
to building a smarter framework for distributed simulations. 

Whether using a client-server or peer-to-peer architecture, packets de- 
scribing agent actions must be formulated, sent, received, and interpreted. 
Until recently, animation interpreters had to be simplistic: consider, for ex- 
ample, the limited state control afforded to DI-Guy from Boston Dynamics. 
The autonomy of agents was limited - mostly to repetitive actions such as 
walking, running, or crawling - or motion-captured units such as firing a 
weapon or falling. In either case motion control had to be explicit and fine- 
grained. Computational techniques, however, have advanced such that agents 
are not only acquiring the ability to perform individual actions on their own; 
they are also able to perform a series of contextually variable actions or be- 
haviors autonomously. Such actions may include reaching for objects, moving 
head and eyes to attend to interesting nearby events, and adjusting locomo- 
tion to avoid obstacles and shift gaits as needed [3]. The goal is to develop 
action representations that can be explored by the underlying network to re- 
duce communication, and at the same time guarantee consistent world state 
among distributed hosts. 

Increasing the autonomy of agents can result in a decrease in necessary 
bandwidth. Consider, for example, sending an agent’s frame-by-frame joint 
angles for all of the actions necessary to animate an agent entering a build- 
ing versus simply sending it a string: enter the building. Naturally with the 
detailed method the simulation has fine-grained control over the agent’s per- 
formance, while with the simple instruction the simulation appears to have 
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little control. That is not the case, however. If the situation requires that the 
agent enter the building carefully, through the blue door , or while watching the 
window above it, there is no simple method to modify the detailed joint or 
motion capture data. If the actions are suitably parameterized, such modifiers 
may be carried immediately in the instruction itself and interpreted locally 
by the agent. Instructions between people carry information that both parties 
can use to drive implementing behaviors. Moreover, poor instructions may 
result in misunderstanding and incorrect actions. Simulations based at the in- 
struction level may help expose potentially negative communication practices 
during a training session. 

We are not arguing that natural language instructions should be used as 
the basis of a simulation packet structure; that would still require too much 
processing capability and interpretation in an agent, and the state of natural 
language processing is not quite ready for that role. But we can learn from 
this form of human-level communication some attributes that an efficient and 
effective distributed simulation packet structure might contain. Over the last 
few years, we have been developing a Parameterized Action Representation 
(PAR) based jointly on the information requirements necessary to animate an 
embodied computer graphics agent as well as to represent the semantics of 
natural language action verbs, adverbs, and prepositions [5]. 



5 Parameterized Action Representation 

Virtual humans can represent other people or function as autonomous helpers, 
teammates, or tutors enabling novel interactive educational and training ap- 
plications. We should be able to interact and communicate with them through 
modalities we already use, such as language, facial expressions, and gesture. 
This section describes our Parameterized Action Representation (PAR), which 
addresses many of the issues with action representations that we have out- 
lined. 

PAR allows an agent to act, plan, and reason about its actions or actions 
of others. Besides embodying the semantics of human action, PAR is designed 
for building future behaviors into autonomous agents and controlling the ani- 
mation parameters that portray personality, mood, and affect in an embodied 
agent. 

We have constructed a PAR and a system (PARSYS) which uses PAR 
as a knowledge base and intermediary between natural language and anima- 
tion [1, 3, 5, 10]. The PAR parameterization was created out of information 
from computer graphics and animation, natural language processing, and hu- 
man movement observation science. Although the emphasis of our research 
has been on the representation and processing of actions, objects are also 
represented in our formalism. 

As a representation for actions as instructions for an agent, PAR has to 
specify (parameterize) the agent, any relevant objects, and information about 
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paths, locations, manners, and purposes. Below, Table 1 shows the highest 
level representation of actions and Table 2 that of objects. 

5.1 PAR Architecture 

For this discussion, it is not necessary to describe the details of the PARSYS 
architecture. It will, however, be helpful to know its general concepts. A highly 
simplified diagram of the PARSYS architecture is shown in Fig. 1. 




Fig. 1 . A simplified diagram of the PARSYS architecture. 

The Actionary™ stores uninstantiated PARs (UPARs). UPARs contain 
the general semantics of an action, but lack specific parameter instantia- 
tion, such as the particular object involved in the action. Instantiated PARs 
(IPARs) are created in Agent Processes. Each character in the virtual world 
has an associated agent process that acts as the brain of the agent. This com- 
ponent is where emotion and personality factors influence the agent’s goals 
and where actions are chosen in pursuit of those goals. Action choice is also 
mitigated by the current state of the world as represented by the World Model 
Once an action is chosen its UPAR is retrieved from the Actionary and in- 
stantiated with specific objects, locations, and manners. Action selection can 
actually be quite complex and may involve querying the Actionary for actions 
that meet certain conditions instead of a specific action by name. 

Every action in the Actionary is associated with one or more Motion Gen- 
erators. Motion generators are pieces of code that perform the action on the 
agent in the virtual world. The motion generators are as important as the 
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action representation and planning components of the PAR system. Some 
motion generators simply replay stored joint angle data. Others can alter this 
data for context [8] or affect [15]. Still others are sophisticated procedural 
animation components [25, 2]. 

5.2 Action Representation 



Table 1. High level action PAR 



type parameterized action 


= 


(name: 


STRING; 


participants: 


agent-and-ob j ect s ; 


applicability conditions: 


BOOLEAN-expression; 


preparatory specification: sequence conditions-and-actions; 


termination conditions: 


BOOLEAN-expression; 


post assertion: 


STATEMENT; 


during conditions: 


STATEMENT; 


purpose: 


purpose-specification; 


subactions: 


par-constraint-graph; 


parent action: 


parameterized action; 


previous action: 


parameterized action; 


concurrent action: 


parameterized action; 


next action: 


parameterized action; 


start: 


time-specification; 


duration: 


time-specification; 


priority: 


INTEGER; 


data: 


ANY- TYPE; 


kinematics: 


kinematics-specification; 


dynamics: 


dynamics-specification; 


manner: 


manner-specification; 


adverbs: 


sequence ad verb- specification; 


failure: 


failure-data) . 



The action representation (Table 1) includes many of the features de- 
scribed in the beginning sections of this chapter. There are fields for low-level 
animation concepts, such as kinematics and dynamics and including the stor- 
age of explicit path information. Through the motion generators, a particular 
PAR could also just represent the replaying of a particular joint angle data 
file, but the advantage of PAR is that this file is now associated with the se- 
mantics of the action it represents; hence, it can be planned for and reasoned 
about. There are many fields in PAR which aid in the planning of actions and 
the execution of those actions. 

Participants can include either objects or other agents and are entities 
that are involved in the action or can be affected by it. For example, picking 
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up a cup from a table involves the agent performing the action, the cup, and 
the table. 

PAR can describe either a primitive or a complex action. Subactions con- 
tain the details of executing the action. If it is a primitive action, the under- 
lying motion generator for the action is directly invoked. A complex action 
can list a number of subactions that may need to be executed in sequence, 
in parallel, or as a combination of both. A complex action can be considered 
done if all of its subactions are done or if its explicit termination conditions 
are satisfied. 

The applicability conditions of an action specify what needs to be true in 
the world in order to carry out an action. These can refer to agent capabili- 
ties, object configurations, and other unchangeable or uncontrollable aspects 
of the environment. The conditions in this boolean expression must be true to 
perform the action. For example, walking actions have an applicability condi- 
tion stating that the agent performing the action must be capable of walking. 
When defining an agent representing an infant, it is not given walking as one 
of its capabilities. If a translatory action became a part of this agent’s plan, 
walking would not be considered, and another action such as crawling would 
be chosen. 

Preparatory specifications are a list of <CONDITION, action> state- 
ments. The conditions are evaluated first and have to be satisfied before the 
current action can proceed. If the conditions are not satisfied, then the cor- 
responding action is performed - it may be a single action or a very complex 
combination of actions, but it has the same format as the other PARs. Ac- 
tions can involve the full power of motion planning to determine, perhaps, 
that a handle has to be grasped before it can be turned. We presently specify 
the conditions to test for likely (but generalized) situations and execute ap- 
propriate intermediate actions. It is also possible to add more general action 
planners, since PAR represents goal states and supports a full graphical model 
of the current world state. 

Termination Conditions are a list of conditions which when satisfied in- 
dicate the completion of the action. Post Assertions are a list of statements 
or assertions that are executed after the termination conditions of the action 
have been satisfied. These assertions update the world model to record the 
changes in the environment. The changes may be due to direct or side effects 
of the action. 

5.3 Object Representation 

The PAR object type (Table 2) is defined explicitly to represent a physical 
object and is stored hierarchically in the Actionary. When a virtual world is 
created, objects are retrieved from the Actionary, instantiated, and placed in 
the world model, where they are updated throughout the simulation. Each 
object in the environment is an instance of this type and is associated with 
a graphical model in a scene graph. An object type lists the actions that can 
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Table 2. High-level object PAR 



type object representation = 



(name: 

is agent: 

properties: 

status: 

posture: 

location: 

contents: 

capabilities: 

relative directions: 

special directions: 

sites: 

bounding volume: 

coordinate system: 

position: 

velocity: 

acceleration: 

orientation: 

data: 



STRING; 

BOOLEAN; 

sequence property-specification; 

status-specification; 

posture-specification; 

object representation; 

sequence object representation; 

sequence parameterized action; 

sequence relative-direction-specification; 

sequence special-direction-specification; 

sequence site-type-specification; 

bounding- volume-specification; 

site; 

vector; 

vector; 

vector; 

vector; 

ANY-TYPE). 



be performed on it and what state changes they cause. Among other fields, 
a list of grasp sites and directions are defined with respect to the object. 
Many of the details of the representation of an object can be filled in as the 
simulation begins (e.g. calculation of the bounding volume). These fields help 
orient actions that involve objects, such as grasping, reaching, and locomotion. 

Agents are treated as special objects that can execute actions. Their prop- 
erties are also stored in the Actionary. Each agent is associated with an agent 
process, which controls its actions based on the personality and capabilities of 
the agent. Not only does an agent’s personality affect its response to a situa- 
tion, but also it affects the way these actions are performed. Two agents with 
different personalities would execute the same action in two different ways. 
For example, two agents could be waving at one another. A shy agent would 
wave its hand more slowly and with more hesitation than an extroverted agent 
would. This increases believability by preventing agents from reacting in the 
same manner in identical contexts and gives the impression that each agent 
has distinct emotions and personality. 



6 PAR for Agent Modeling 

Given that PAR can be used to animate embodied agents, even from natural 
language instructions, can it be used to generate more “character” -rich agents? 
In this section, we will show that PAR adequately represents the components 
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necessary for modeling embodied agents and further that it is compatible with 
common methods for modeling emotion and personality in agents. 

In [19], Funge et al. described a hierarchy of computer graphics modeling. 
The bottom two layers depicted early computer graphics research in geomet- 
ric models and inverse kinematics. Physical models generate realistic motion 
through dynamic simulation. Behavioral modeling involves characters that 
perceive environmental stimuli and react appropriately. Through cognitive 
modeling, autonomous characters can be given goals and react deliberatively 
as well as react ively. 

PAR and PARSYS accommodate and enable each level in this hierarchy. 
While the actual geometry is assumed to have been created before the simula- 
tion begins, PAR does represent and PARSYS automatically recognizes some 
vital geometric constructs. Bounding volumes, for example, can be calculated 
as soon as the geometry is loaded into the system. Spatial properties, such as 
location and containment, can also be recognized and stored. Updating and 
storing this information in a central location means that it does not have to be 
calculated by every object manipulator. Kinematics and dynamics are explic- 
itly represented in PAR. Furthermore, PAR has been tied to a fast, analytic 
inverse kinematics program [37] that facilitates the generalization of actions 
such as reaching. 

The behavioral component of embodied agency is at the foundation of 
PARSYS. Th eWorld Model is updated to provide the necessary processes, 
agent processes and motion generators, with information on the current state 
of the environment. Currently, this resource is shared by all of the agents 
and motion generators. It is, however, possible to provide each agent process 
with its own world model so that it represents an agent’s own unique view of 
and beliefs about the state of the world. The embodied agents can be given 
goals directly in the form of PAR or through natural language instructions. 
An agent tries to complete its goals by performing actions. Reactivity to the 
environment takes place in two forms. First, the agent processes and motion 
generators have quick access to the current state of the environment through 
PAR allowing them to refine a motion or even terminate an action. Second, 
PAR contains information about failure states and PARSYS has the ability 
to detect failures and notify the agent process with the information necessary 
to handle the failure. In PARSYS failures are anything that causes a mo- 
tion generator to terminate before its termination conditions have been met. 
For example, a motion generator may check to ensure that the preparatory 
specifications of the action it is performing are maintained throughout. If the 
specifications are not maintained, a failure can be generated and returned to 
the agent process where a decision could be made to try to reestablish the 
specifications or abort the action. 

The way in which an agent responds to changes in the environment, the 
way in which agents pursue their goals, and even which goals are most im- 
portant are aspects of cognitive modeling. PARSYS contains mechanisms for 
planning and also filtering and prioritizing the actions that the planner can 
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plan with, thereby individualizing the agent. During the planning process, 
the planner queries the Actionary for actions that match the conditions it is 
trying to meet. Before the satisfying actions are returned to the planner, an 
action filter removes any actions that the agents would not do in the current 
situation and prioritizes the remaining actions. For example, walking might 
be prioritized over running or skipping in the satisfaction of a locomotion 
condition because of either the nature of the agent (businessman or child) or 
sensitivity to motion goals or qualities (manner). 



6.1 Personality and Emotions 

The actions of the action filter may be dependent on any aspect of the agent, 
including its personality or current emotion level. Two popular models for 
personality and emotion are OCEAN [39] and OCC [28] respectively. 

Personality is a pattern of behavioral, temperamental, emotional, and men- 
tal traits that distinguish people from one another. Traits are basic tendencies 
that remain stable across the life span, but characteristic behavior can change 
through adaptive processes. The ways in which a person perceives, acts, and 
reacts is influenced by his or her personality. While there is no universally ac- 
cepted theory, the Big Five or OCEAN model has gained some acceptance [39]. 
The “Big Five” represent a taxonomy of traits that some personality psychol- 
ogists suggest capture the essence of individual differences in personality. The 
traits of the Big Five model are shown in Table 3. 

Openness means a person is imaginative, independent-minded, and has 
divergent thinking. Openness to experience describes the breadth, depth, 
originality, and complexity of an individual’s mental and experiential life. 
Conscientiousness means a person is responsible, orderly, and dependable. 
Conscientiousness describes socially prescribed impulse control that facili- 
tates task-directed and goal-directed behavior, such as thinking before acting, 
delaying gratification, following norms and rules, and planning, organizing, 
and prioritizing tasks. Extroversion means that a person is talkative, social, 
and assertive. It implies an energetic approach to the social and material 
world and includes traits such as sociability, activity, assertiveness, and posi- 
tive emotionality. Agreeableness means a person is good natured, cooperative, 
and trusting. Agreeableness contrasts a pro-social and communal orientation 
toward others with antagonism and includes traits such as altruism, tender- 
mindedness, trust, and modesty. Neuroticism means a person is anxious, prone 
to depression, and worries a lot. It contrasts emotional stability and even- 
temperedness with negative emotionality, such as feeling anxious, nervous, 
sad, and tense. 

One of the most popular models for emotion is the OCC model, named 
after its authors [28]. In this model, emotions are generated through the 
agent’s construal of and reaction to the consequence of events, actions of 
agents, and aspects of objects. Many researchers have based their work on 
this model [17, 7, 21]. 
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Table 3. OCEAN model of personality 





High Score Traits 


Low Score Traits 


Openness 


Creative, Curious, Complex 


Conventional, Uncreative 


Conscientiousness 


Reliable, Well-organized, 
Self-disciplined, Careful 


Disorganized, Undependable 


Extraversion 


Sociable, Friendly, Fun-loving 


Introverted, Reserved, Quiet 


Agreeableness 


Good natured, Sympathetic, 
Forgiving, Courteous 


Critical, Rude, Harsh, Callous 


Neuroticism 


Nervous, Insecure, Worrying 


Calm, Relaxed, Secure 



Table 4 shows part of the PAR representation for agents. The parameters 
of the OCEAN model are represented as values along the scales of each of 
the characteristics. There is more information needed to implement the OCC 
model. First, the standards and values of the agent must be represented. 
These can be represented as statements that contain PAR actions. Essentially, 
each action can be associated with a number corresponding to the agent’s 
thought of that action. Agents or classes of agents can also be associated with 
the actions to create more specific standards. Goals are actions with high 
priorities. Agents and objects can be tagged with information representing 
the agent’s degree of cognitive unity and liking of the object. 



Table 4. Partial PAR agent representation 

type parameterized agent = 

(name: STRING; 

personality: OCEAN-parameter-spec; 

Openness INTEGER; 
Conscientiousness INTEGER; 
Extraversion INTEGER; 
Agreeableness INTEGER; 
Neuroticism INTEGER; 
emotion: OCC-specification; 

standards: sequence STATEMENT; 

goals: sequence parameterized action; 

appraisals: sequence cogn-unit-specification; 

sequence appraisal-specification; 



6.2 EMOTE for Displaying Affect 

The implementation of personality or emotion for embodied characters must 
extend further than decision making or action selection. The quality of move- 
ment in an action is also affected by personality and emotion. We have de- 
veloped a parameterized system for creating more expressive gestures. The 
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EMOTE system [40, 41, 4, 15] is based on movement observation science. 
Laban Movement Analysis (LMA) is a method for observing, describing, no- 
tating, and interpreting human movement. Two of LMA’s components are 
Effort and Shape. Effort involves the dynamic qualities of movement. Shape 
describes the changing forms that the body makes in space. Effort comprises 
four motion factors: Space, Weight, Time, and Flow. Each motion factor is a 
continuum between two extremes: indulging in the quality or fighting against 
the quality. Table 5 describes the Effort qualities. Shape changes in move- 
ment can be described in terms of three dimensions: horizontal, vertical, and 
sagittal. 



Table 5. Effort and Shape elements 



Space: attention to the surroundings 

Indirect: flexible, meandering, wandering, multi-focus 

Examples: waving away bugs, slashing through plant growth 

Direct: single focus, channeled, undeviating 

Examples: pointing to a particular spot, threading a needle 

Weight: sense of the impact of one’s movement 

Light: buoyant, delicate, easily overcoming gravity, marked by decreasing 

pressure 

Examples: dabbing paint on a canvas, describing the movement of a feather 

Strong: powerful, having an impact, increasing pressure into the movement 

Examples: punching, pushing a heavy object, expressing a firmly held opinion 

Time: lack or sense of urgency 

Sustained: lingering, leisurely, indulging in time 

Examples: stretching to yawn, stroking a pet 

Sudden: hurried, urgent 

Examples: swatting a fly, grabbing a child from the path of danger 

Flow: attitude toward bodily tension and control 

Free: uncontrolled, abandoned, unable to stop in the course of the 

movement 

Examples: waving wildly, shaking off water 

Bound: controlled, restrained, able to stop 

Examples: moving in slow motion, tai chi, carefully carrying a cup of hot liquid 



Horizontal 

Spreading: affinity with Indirect 
Enclosing: affinity with Direct 

Vertical 

Rising: affinity with Light 

Sinking: affinity with Strong 

Sagittal 

Advancing: affinity with Sustained 
Retreating: affinity with Sudden 
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We have created many demonstrations of the EMOTE parameters. One 
such demonstration involved a virtual character hitting and touching a balloon 
[18] (see Fig. 2). Here the same basic animation data (from motion capture) 
for hitting was altered by the EMOTE system generating several different 
types of hitting and even touching. 




Fig. 2. EMOTE alterations of hitting a balloon. The bottom left panel indicates 
very little force is applied to the balloon. The bottom right panel shows a much 
greater force. Both were created from the same key frame data, but with differing 
EMOTE settings 



It is our goal to formally link these EMOTE parameters with OCEAN 
and OCC parameterizations. Table 6 shows an initial linking of EMOTE and 
OCEAN. This linkage is based on descriptions of LMA [6] and OCEAN [39] 
and is included only as an example of the type of mappings needed. We plan 
to verify or modify this linkage by showing agents exhibiting these qualities 
to naive observers and having them complete a questionnaire about the per- 
sonality characteristics of the agent. We also plan to use a learning process 
to build the mapping between OCC and EMOTE. Automatically acquiring 
motion qualities from observation, and validating them to make sure they are 
consistent with the LMA concepts and theories, not only are essential com- 





34 



Jan Allbeck and Norm Badler 



Table 6. Example EMOTE and OCEAN linkage 





Space 


Weight 


Time 


Flow 


Openness 

High 

Low 


indirect 

direct 


light 

strong 


sustained 

sudden 


free 

bound 


Conscientiousness 

High 

Low 


direct 

indirect 


strong 

light 


sudden 

sustained 


bound 

free 


Extraversion 

High 

Low 


indirect 

direct 


light 

strong 


sustained 

sudden 


free 

bound 


Agreeableness 

High 

Low 


indirect 

direct 


light 

strong 


sustained 

sudden 


free 

bound 


Neuroticism 

High 

Low 


direct 

indirect 


strong 

light 


sudden 

sustained 


free 

bound 



ponents to complete the EMOTE system in particular, but also can offer a 
powerful and valuable methodological tool for analyzing gestures and help- 
ing to create natural, personalized communicative agents. In [40] Zhao has 
developed a neural network-based system to achieve this goal. The system 
inputs 3D motion capture and outputs a classification of EMOTE qualities 
that are detected in the input. The networks are trained with professional 
LMA notators to ensure valid analysis. 

Future work in the EMOTE system and the motion quality recognizer 
will be to train the system to correlate captured motions with actor affect, 
behavior, mood, and intent. The critical problem in such training is setting up 
appropriate situations that truly elicit affective responses in individuals. We 
believe that the key ingredients to successful data generation are immersive 
experiences with both live and virtual agents. Engaging with either or both 
real and virtual agents in the same circumstances will be crucial to evaluating 
effectiveness and calibrating responses across the reality/virtual divide. Using 
the motion capture and post-session analysis, ground truth information can 
be supplied for training sets. The neural network models may then connect 
motion qualities to expressed affect and mood. Although the LMA community 
recognizes that such a mapping may exist it has not yet been possible to 
investigate it in a visually and computationally adequate environment. 



7 Interfaces to Representations 

In the previous section describing PAR (Sect. 5), PARs were retrieved and 
instantiated by autonomous agent processes. This convention is convenient 
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when explicit control over the characters is not required, or there is the abil- 
ity to code the agents’ planning to one’s specifications. Often other control 
mechanisms are desired. 

Certainly basic scripting languages that outline that an agent is to per- 
form a specified action at a specified time can be created. Richer scripting 
constructs can be introduced when the virtual world creators are assumed to 
be more knowledgeable [29]. 

Drag-and-drop creation applications for virtual environments are also be- 
ing created [35, 31]. In these applications, nodes represent agent functionality, 
such as walking. Nodes can then be linked in logical ways to create simulations. 

Another way of controlling an agent is through natural language. PAR 
was designed as an intermediary between natural language and computer an- 
imation, and as such is able to build agent behavior for virtual environments 
through natural language instructions [5, 10]. 



8 Conclusions and Future Research 

An action representation should afford both autonomy and control where 
appropriate. It should minimize data storage, while providing for expressive 
motions. It should provide semantics for planning from the motion level to 
the cognitive level. The PAR that we have presented in this chapter has made 
strides toward accomplishing these requirements. 

Another aspect of action representations that has not yet been included 
in PAR or the PAR system is level of detail. The level of detail of objects has 
long been a subject of graphics research. Large-scale, distributed simulations 
give us the opportunity to expand the level of detail concept to actions as 
well. Nearby actions involving objects may need to be enacted using inverse 
kinematics. At further distances, similar actions could be enacted by replay- 
ing motion capture data, because contact relations may not be noticeable. 
It is not even necessary to display actions that are outside an agent’s circle 
of influence [26, 36]. Nonetheless, other agents may still need to be aware of 
the action that an agent is performing, the consequences of the action, and 
whether or not the action was performed successfully. Thus PAR can be used 
to communicate agent activities even if those actions are not directly seen or 
even executed; PAR can be a cognitive representation for conveying action in- 
formation between agents. PAR might even be used to define “need-to-know” 
multi-cast groups. It uses fields that may be loaded, modified, interpreted, and 
transferred like data packets, and are generally used as dynamic information 
objects. An additional bonus with PAR is that its language origins allow its 
contents to be output as a sentence, making PAR a convenient resource for 
After Action Summaries and Reviews. 

PAR was designed to be a flexible representation, meaning that many 
different types of information can be represented. Not all of the fields of the 
PARs need to be filled in for every action. When considering a representation 
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for use with embodied conversational agents we should consider the trade-offs 
between parameterization specificity and program complexity. If you specify 
every joint angle for your character at every frame of the animation, your 
program needs only to display these angles on the figure. If you only specify 
that your agent needs to get some milk, then your program will need to figure 
out all the aspects of acquiring milk from high level planning to intricacies of 
movement. Our experience with PAR and PARSYS leads us to conclude that 
they have the right balance of specificity and complexity. 

That is not to say that there is not more work to be done. We would like 
to represent PAR in XML format so that it is more widely available to other 
researchers. Much work also needs to be done to establish the connection 
between EMOTE parameterization and models of personality and emotion. 
We are continuing to work on better planning and smarter motion generators 
for the PARSYS. Finally, although there is a natural language interface for 
PARSYS, conversation and dialog are not currently considered. A represen- 
tation and system for modeling conversation and its timing, such as BEAT 
[14], would certainly enhance our system. 
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Summary. Life-like animated agents present a challenging ongoing agenda for re- 
search. Such agent metaphors will only be widely applicable to online applications 
when there is a standardized way to map underlying engines with the visual pre- 
sentation of the agents. This chapter delineates functions and specifications of two 
markup languages for scripting the animation of virtual characters. The first lan- 
guage is Character Markup Language (CML) which is an XML-based, embodied 
agent, character attribute, definition and animation scripting language designed to 
aid in the rapid incorporation of life-like agents into online applications or virtual 
reality worlds. CML is constructed based jointly on motion and multi-modal ca- 
pabilities of virtual human figures. The other is Avatar Markup Language (AML) 
which is also an XML-based multi-modal scripting language designed to be easily 
understandable by human animators as well as easily generated by a software pro- 
cess such as an agent. We illustrate the constructs of the two languages and look 
at some examples of usage. The experience gained through the development of two 
such languages with different approaches yet similar aims highlights the need for a 
degree of unification. This is especially true given that a number of other similar 
languages exist as illustrated in other parts of this book. We attempt to define met- 
rics for comparison of a set of these languages with the aim of identifying salient 
constructs for a unified scripting language. 



1 Introduction 

An account of two approaches to specifying scripting languages for character 
animation which are currently being developed and evaluated at Imperial Col- 
lege London, the Character Markup Language (CML) [1, 10] and the Avatar 
Markup Language (AML) [9, 10], is presented. Each approach evolved through 
the context of the projects they were developed within. CML took a top-down 
approach by defining high-level attributes for character personality, emotion, 
and behavior that are integrated to form the specification of synchronized an- 
imation script. New or unspecified behaviors are formed by blending together 




40 



Yasmine Arafa, Kaveh Kamyab, and Ebrahim Mamdani 



base elements and attributes thereby providing animators with the flexibility 
to generate animation script as required. On the other hand, AML took a 
bottom-up approach in which the language provides a generic mechanism for 
the selection and synchronized merging of animations. In addition, AML pro- 
vides the flexibility for animators (human or non-human) to define higher level 
specifications based on the key elements provided plus any others that may be 
defined. The generic nature of AML implies that any software implementation 
supporting it will be fairly simple. 

At the time the CML and AML languages were being developed parallel 
attempts for similar languages were also underway. The appearance of these 
has highlighted the need for powerful yet generic scripting languages to bridge 
the gap between behavior generation and animation tools. As a number of such 
scripting languages now exist, there appears to be a need for the research 
community to look at and agree upon the requirements of and expectations 
from them. 

This chapter describes the key features and capabilities CML and AML 
offer and discusses the technical issues they raise based on our design and de- 
velopment experience on the ESPRIT project EP28831 MAPPA, 1ST project 
IST-1999-10192 SoNG, and 1ST project IST-1999-11683 SAFIRA. The chap- 
ter further sets forth the key functionality that such description and scripting 
languages will need to succeed in animated agent interaction applications. 



2 Scripting with the Character Markup Language 

CML was developed with the aim of bridging the gap between the underlying 
affect and process engines, and agent animation tools. CML provides a map 
between these tools by automating the movement of information from XML 
Schema definitions into appropriate relational parameters required to gener- 
ate the intended animated behavior. This would allow developers to use CML 
as a glue-like mechanism to tie the various visual and underlying behavior 
generation tools together seamlessly, regardless of the platform that they run 
on and the language they are developed with. The term “character” is used 
to denote a language that encapsulates the attributes necessary for believable 
behavior. The intention is to provision for characters that are life-like but are 
not necessarily human-like. Currently the attributes specified are mainly con- 
cerned with visual expression, although there is a limited set of specifications 
for speech. These attributes include specifications for animated face and body 
expression, behavior, personality, role, emotion, and gestures. 

2.1 Visual Behavior Definition 

Classification of behavior is governed by the actions an agent needs to perform 
in a session to achieve given tasks, and is influenced by the agent’s personality 
and current mental state. A third factor that governs character behavior is the 




Towards a Unified Scripting Language 



41 



role the agent is given. A profile of both an agent’s personality and its role are 
used to represent the ongoing influences on an agent’s behavior. These pro- 
files are user- specified and are defined using XML annotation. The behaviors 
are defined as XML tags, which essentially group and annotate sets of action 
points generally required by the intended behavioral action. The CML pro- 
cessor will interpret these high-level behavior tags, map them to appropriate 
action point parameters, and generate an animation script. 

CML defines the syntactic, semantic, and pragmatic character presentation 
attributes using structured text based on the XML Schema definition. The 
character markup-based language extends the descriptions for facial expres- 
sions used in the FACS (Facial Action Coding System) system. FACS defines 
a set of all the facial movements performed by a human face [4]. Although 
FACS is not an SGML-based language in nature, we use its notion of Action 
Units to manipulate expressions. Character gesture attribute definitions are 
based on the research and observations by McNeill [11] on human gestures 
and what they reveal. 

Affective expression is achieved by varying the extent and degree values 
of the low-level parameters to produce the required expression. The CML 
encoder will provide the high-level script to be used in order to specify the 
temporal variation of these facial expressions. This script will facilitate de- 
signing a variety of time- varying facial expressions using the basic expressions 
provided by the database. 

2.2 Classification of Motion 

The conceptual architecture upon which the classification of motion is based 
is loosely derived from that defined by Blumberg and Russell’s research [3]. 
Blumberg and Russell’s architecture uses a three-layer structure which in- 
cludes: geometry, motor, and behavior system. We assume a motor generation 
module which is responsible for the basic movements along with correlated 
transitional movements that may occur between them. Personified animation 
scripts are generated by blending the specification of different poses and ges- 
tures. The base motions are further classified by generic controls that are 
independent of the character itself. For example, a generic move motion can 
have different representations which are determined by the character emo- 
tional and/or personality attributes defined to represent nod, iconic gesture, 
head, hand or body gesture, walk, etc. Additionally, the language motion cate- 
gories should cater for the fact that behavior can be expressed through and can 
affect different parts of the character face (or body part). To realize different 
parts of a character head/body skeleton which are to be affected while per- 
forming a movement CML divides the character element specifications into 
four units: Head, Upper, Middle, and Lower parts. CML then provides the 
specification of the constructs of each unit with varying granularity. 

Action composition script is generated by a CML processor (delineated in 
Fig. 1 below) which blends actions specified with an input emotion signal to 
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select the appropriate gestures and achieve the expressive behavior. CML also 
provisions for the generation of compound animation script by facilitating the 
definition and parameterization of sequences of base movements. 

The chosen base set of movements allows basic character control (move- 
ment and interactions) as well as assures the capability to perform unlimited 
character-specific animations. The interactions can involve other characters 
and objects that must be referenced by a valid id within the Graphics Engine. 

The initial set of CML base motions is classified by the goal of the motion 
as follows: 

• Movement defines motions that require the rotation or movement of a 
character from one position to another. Positions are defined by exact 
coordinates, an object position, or a character position. The CML elements 
defined for Movement are either move-to or turn-to. 

• Pointing defines a pointing gesture toward a coordinate, object, or char- 
acter. The CML element defined for this movement is point-to. 

• Grasping defines motions that require the character to hold, throw, or 
come in contact with an object or another character. The CML elements 
defined for Grasping are grasp, throw and touch. 

• Gaze defines the movements related to the head and eyes. The CML el- 
ements defined for Gaze are gaze, track, blink, look-to, and look- at. The 
gaze and track elements require that only the eyes be moved or track an 
object or character; look-to and look-at require the movement of both head 
and eyes. 

• Gesture includes motions that represent known gestures like hand move- 
ments to convey an acknowledgment, a wave, etc. The CML elements 
defined for Gesture are gesture and gesture- at. 

2.3 CML Specification 

CML defines a script like that used for a play. It describes the actions and se- 
quence of actions that will take place in a presentation system. The script is a 
collection of commands that tell the objects in the world what to do and how 
to perform actions. The language is used to create and manipulate objects 
that are held in memory and referenced by unique output-ontology objects. 
The structure of the language begins with a command keyword, which is usu- 
ally followed by one or more arguments and tags. An argument to a command 
usually qualifies a command, i.e. specifies what form of action the command is 
to take, while a tag is used to denote the position of other necessary informa- 
tion. A character expression markup module will add emotion-based markup 
resulting from emotional behavior generation rules to the CML descriptions. 

Animated character behavior is expressed through the interpretation of 
XML Schema structures. These structure definitions are stored in a Schema 
Document Type Definition (DTD) file using XSDL (XML Schema Definition 
Language). At run-time character behavior is generated by specifying XML 
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tag/text streams which are then interpreted by the rendering system based 
on the rules defined in the definition file. Its objective is to achieve a consis- 
tent convention for controlling character animation models using a standard 
scripting language that can be used in online applications. 

The language contains low-level tags defining specific character gesture 
representations defining movements, intensities, and explicit expressions. There 
are also high-level tags that can define commonly used combinations of these 
low-level tags. In the sections on CML Representation Language and CML 
Scripting Language we outline the base elements defined. We do not describe 
the syntax in too much detail here. Interested readers are advised to refer to 
[!]• 

Synchronization between the audio and visual modalities is achieved 
through the use of SMIL (Synchronized Multimedia Integration Language) 
specifications [16]. SMIL defines an XML-based language that allows authors 
to write interactive multimedia presentations. Basically, CML uses the SMIL 
<par> and <seq> tags to specify the temporal behavior of the modalities being 
presented. The <seq> tag defines the order, start time, and duration of execu- 
tion of a sequence, whereas the <par> tag is used to specify that elements be 
played in parallel. For further flexibility, CML also provides order and time 
synchronization attributes. 

2.4 CML Representation Language 

CML provides a set of base description/representation languages that are 
integrated with the face and body animation markup languages enabling these 
multi-modal features in a hybrid representation architecture. 

Head Gesture Taxonomy 

The gesture classifications here are defined as follows: 

• Symbolic gestures relate to universal symbols and commonly acknowledged 
gestures across cultures (e.g. repeated up and down nods are symbolic 
of agreement or shaking from side to side would usually be symbolic of 
disagreement). 

• Iconic gestures are used to demonstrate a symbolization of a particular 
behavior or action (e.g. rotating the head showing that one is dizzy). 

• Deictic gestures are gestures that point in the direction of objects. 

Hand Gestures Taxonomy 

CML captures gesture features such as postures of the hand (straight, relaxed, 
closed), its motion (moving, stopped), and its orientation. Over time, a stream 
of gestures is then abstracted into more general ‘gestlets’ (e.g. Point-at, sweep, 
end reference). Although, here, the recorded action takes place in the 2D plane, 
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similar phenomena play a role as in the case of 3D hand gesturing, but with 

a much easier signal processing involved. 

• Posture defines the position the hand is held in and a duration. 

• Motion defines whether the hand is in motion or stagnant and a speed 
defining the transition between each state. 

• Orientation defines the direction of a movement (up, down, left, right, for- 
ward, backward). Directions are derived from the normal and longitudinal 
vectors of the palm. 

• ‘Gestlets ’ define a set of high-level tags that are constituted from the lower 
level tags described above. These make up gestures like Point, Wave, etc. 

• Fingers define high-level tags for each of the five fingers. 



Body Gestures Taxonomy 

The research for the base body gestures is partly derived from work conducted 

on the research and analysis of body expressions. 

• Natural defines the character’s default or normal posture state based on a 
distinct personality. 

• Relax defines a relaxed posture state of the character. 

• Tense defines a tensed posture state of the character. 

• Iconic gestures are used to demonstrate a symbolization of a particular 
behavior or action (e.g. rotating the head showing that one is dizzy). 

• Incline defines the orientation and degree a character might lean toward 
or against an object. 



Emotions 

The list below shows the specification for the attributes for an emotion. The 
attributes considered for the description of an emotion are based on the OCC 
theory of emotions [12], which was used to partially support the system im- 
plementation. 

• Class: The id of the emotion class being experienced. 

• Valence : The basic types of emotional response (neutral, positive, or neg- 
ative value of the reaction). 

• Subject: The id of the agent experiencing the emotion. 

• Target: The id of the event /agent /object toward which the emotion is 
directed. 

• Intensity: The intensity of the emotion (a logarithmic scale between 0 and 

10 ). 

• Time-stamp: The moment in time when the emotion was felt. 

• Origin: The id of the event /agent /object that caused a change in emotion. 
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The attribute Class describing an emotion refers to the type of that emo- 
tion. An emotion type represents a family of related emotions differing in 
terms of their intensity and manifestation, i.e. each emotion type can be re- 
alized in a variety of related forms. For example, fear with varying degrees of 
intensity can be seen as concern, fright, petrified. 

The attribute Valence describes the value (positive or negative) for the 
reaction that originated the emotion. According to this theory, emotions are 
always a result of positive or negative reactions to events, agents, or objects. 
The Subject and Target attributes for emotions, define the entities related to 
them. The Subject defines the agent experiencing the emotion and the Target 
defines the event, agent, or action that originated the emotion. 



2.5 CML Scripting Language 
Face Animation Scripting 

Character face description is a set of low-level tags based on MPEG-4 FAPs, 
and a set of high-level tags representing facial parts which are grouped from 
a set of respective low-level tags. CFML base elements are as follows: 

1. Head Movement 

CML defines two base head movements which are tilt and turn . Complex 
head movements are generated using a combination or sequence of the de- 
fined base movements. The difference between the tilt and turn movements 
is that tilt is a head movement in a slant with often subtle or superficial 
neck movement, whereas turn would require more profound movement of 
the neck. The following is the syntax of a turn movement (details of the 
CML full syntax can be found in [4]): 

<turn-to> 

<order {0 to n/bef ore/after} /> 

<priority 0 to n /> 

<begin {ss :mnun/bef ore/ after/object} /> 

<end {ss :mmm/before/af ter/object} /> 

<speed {O.n to n.n (unit) /default/slow/fast} /> 

<target {x,y ,z/object/character> /> 

<direction (rightside, leftside , frontside , backside) /> 

<degree (n%) /> 

<repeat {0 to n/dur} /> 

<interrupt> {yes/no} /> 

<transAnimat {head groups} /> 

<transPos {x,y ,z/object/character} /> 

<transSpeed default /slow/ intermediate/fast} /> 
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</turn-to > 



This code shows the basic syntax for the animation of a “turn-to” head 
movement in CFML. The syntax includes parameters for synchronization, 
pace, object, and animation handling and transition option. Synchroniza- 
tion parameters are either absolute or relative to other animations or 
sequences. Pace is the speed in which the movement is implemented and 
is usually governed by the predominant emotional state. Object defines 
the target properties to which to turn to. These properties are also either 
absolute or relative to another character or object or a general direction. 
The animation handling parameters support options to allow multiple rep- 
etitions of the movement at specified intervals and to specify whether or 
not the animation can be interrupted. If the “interrupt” tag is set then 
the “transition option” tags specify the appropriate transition or fall-back 
animations need to sustain a smooth and believable animation. It can be 
noted here that modifiers to a typical “turn-to” movement to reflect emo- 
tional state and personality behaviors are achieved by encapsulating the 
element in an “emotion” tag, which will influence the intensity, speed, and 
manner the movement is carried out. 

2. Head Gesture 

CML also defines two basic head gestures, either Deictic or Symbolic. The 
following is an example of a Symbolic gesture (further details can be found 
in [4]): 

<disagree> 

<tilt-to-right> 

</tilt-to-right> 

</tilt— to-left > 

</disagree> 



This is an example of a complex tag that is comprised of two base- 
movement tags to form a symbolic disagree head gesture. Synchroniza- 
tion and pace are achieved by specifying the corresponding parameters as 
described in the description of the “turn-to” example. The duration and 
specific behavior in which such a gesture is implemented is predisposed 
by the character’s defined personality properties and emotional state. As 
described earlier this can be achieved by wrapping the element within an 
“emotion” tag. 

3. Face Movement and Gesture 

They define the elements and behaviors for specific parts of the face in- 
cluding the Brow , Gaze , and Mouth. 
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Body Animation Scripting 

Character body description is a set of defined body elements which are low- 
level tags based on MPEG-4 BAPs, and a set of high-level tags representing 
body parts which are grouped from a set of respective low-level tags. CBML 
base elements are as follows: 

1. Movement (moving, bending, turning). 

2. Gesture defines body postures that include motions representing common 
Iconic, Symbolic, or Deictic body gestures. 

3. Posture (Expression) defines a set of high-level tags representing general 
body gestures. 

• “Natural” defines the character’s default or normal posture state based 
on a distinct personality. 

• “Relax” defines a relaxed posture state of the character. 

• “Tense” defines a tensed posture state of the character. 

• “Incline” defines the orientation and degree a character leans toward 
or against and object. 



2.6 Generating Script 

Script generation through to the effected animation process components con- 
sist of a set of MPEG-4-compliant facial and body models; high-level XML- 
based descriptions of compound facial and body features; XML-based descrip- 
tions of user-specified personality models; behavior definitions; a CML pro- 
cessor; and finally a CML decoder. The general function of this component is 
delineated in Fig. 1. 




Fig. 1. CML script generation - Function Abstract 
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The architecture of an implementation generating and using CML is 
divided into three conceptual components of the supporting models and 
database for face and body animation, CML scripting, and an animation ren- 
dering tool. The script generation component assumes state and contextual 
input resulting from the underlying affective processing, planning, and do- 
main knowledge-base engines. Based on these inputs and a defined character 
personality, the CML processor then generates the consequent synchronized 
behavioral action and utterance CML script. The script is then passed on to 
the CML decoder which parses the CML and maps its elements onto view- 
specific commands for final animation rendering. 



3 Avatar Markup Language 

AML was developed in the context of the 1ST project SoNG in collaboration 
with IIS, Miralab, and LIG. The objective of the project was to design and de- 
velop a full end-to-end MPEG-4 multimedia framework to support, amongst 
other features, 3D avatar-based multi-user chat rooms and autonomous syn- 
thetic characters. The first of these was to be facilitated via the development of 
an interface tool that allowed users to define animation sequences by selecting 
and merging predefined and proprietary animation units. Likewise, synthetic 
characters were to be controlled in a similar manner to fill roles such as sales 
assistants in virtual shops. The focus was on providing the tools and infras- 
tructure necessary to anybody who would like to develop such applications. 
Hence, a common mechanism was needed to allow both human users and au- 
tonomous agent-based systems to define full face and body avatar animation. 
However, it was important to allow future users or developers to animate their 
avatars using non-procedural commands whilst trying not to limit their cre- 
ativity by imposing predefined facial expressions or gestures on them. Also, we 
were aiming at providing a means of generating externally observable behav- 
ior and not on specifying a mapping between internal reasoning and behavior, 
as is the case with many other scripting systems. 

The design of such a mechanism saw the animation process conceptually 
divided into three components. First, a database of basic facial and body ani- 
mation units, which could be extended or modified by a third party interested 
in generating avatar animations. These animations can be either specified by 
hand or achieved via motion capture. Examples include smiling or waving. 
Second, a rendering system capable of merging multiple face and body ani- 
mation units and text to speech input in realtime. Finally, a high-level script- 
ing language designed to allow animators - both human and non-human - 
to specify which animations to use together with timing, priority, and decay 
information. The resulting scripting language - AML - is the only one of the 
three components that we specify. 

AML facilitates multi-modal interaction based on embodied characters by 
allowing users or agents to trigger appropriate face animation, body anima- 
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tion, and TTS modules in a time-synchronized and easy manner. This may 
involve mixing of multiple gestures and expressions into a single animation. 
Originally, no basic animations were considered compulsory within AML, but 
it became obvious that some parameterized behaviors would have to be pro- 
vided. Examples of such behaviors include pointing, facing, and walking. Each 
of the behaviors is generated by the implemented rendering system by calcu- 
lating the movement of the avatar as a function of its initial position and the 
target coordinates supplied by the animator. 

3.1 AML Specification 

Having given a brief overview of the requirements and purpose of AML we 
will now have a look at the syntax of the language. AML is an XML-based 
scripting language. Figures 2 to 4 give an outline of the AML syntax. We will 
not describe the syntax in too much detail here. Interested readers are advised 
to refer to [9]. 

Each individual AML script is encapsulated by the AML root node and 
consists of either a Facial Animation (FA) node or a Body Animation (BA) 
node or both. FA nodes may contain a combination of TTS nodes and Avatar 
Face Markup Language (AFML) nodes, the syntax of which is illustrated in 
Fig. 2. 

<AFML> 

<Settings> 

</Settings> 

<ExpressionsTrack name=" Track name"> 

<Expression> 

<StartTime>min: ss :mmm</StartTime> 

<ExName> "name " </ExName> 

<Envelope> 

</Envelope> 

</Expression> 

</ExpressionsTrack> 

</AFML> 



Fig. 2. AFML syntax 



Here we highlight the flexibility that is given to an animator to define as 
many Expression Tracks as required, each containing as many Expressions 
as required. Expressions are stored in a database of facial animation units. 
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For example, in an MPEG-4-based system, such units would represent anima- 
tions as values for MPEG-4 FAPs for an arbitrary number of frames. A start 
time and an envelope specifying decay, duration, and intensity accompany 
each one. In addition, Speech Tracks may be specified when a TTS engine is 
not available or suitable. Similarly, BA nodes contain Avatar Body Markup 
Language (ABML) nodes. The syntax can be seen in Figs. 3 and 4. 



<Body Animat ionTrack name="char, name of track"> 

<User Animat ion type="bap I trk I wrl" f ilename="char"> 

<StartTime>mm: ss : mmm I auto synch | autoafter</StartTime> 
<Speed>normal I slow I f ast</Speed> 

<Intensity>f loat , 0 to fn</Intensity> 
<Priority>integer , 0 to n</Priority> 

< /U s er Animat i on> 

<Emotion>. . .</Emotion> 

<StandardGesture> . . . </StandardGesture> 

</Body Animat ionTrack> 



Fig. 3. ABML’s BodyAnimationTrack syntax 



Figure 3 shows the basic animation capabilities of ABML. As with AFML, 
ABML comprises a Settings node and allows us to define one or more BodyAn- 
imationTrack. Animation can be adapted by means of modifiers such as Speed, 
Intensity, and Priority. StartTimes are used for synchronization and can be in 
the form of an absolute time or a relative indicator of the start time (autosynch 
or autoafter) with respect to other animations within a track. We focus our 
attention on some of the subnodes of BodyAnimationTrack. Like AFML, body 
animation units can be retrieved from a database using the UserAnimation 
node. However, predefined emotional indicators can be selected as either a 
gesture or a posture and standard gestures are provided to support standard 
interaction. 

In Fig. 4 we draw attention to a set of parameterized behavior nodes, 
namely FacingAction, PointingAction, WalkingAction, and ResettingAction. 
Each behavior node specifies modifiers such as StartTime, Speed, and Priority. 
A subset also specifies target coordinates for the behavior. For walking, a 
number of control points can be specified to define the route taken by an avatar 
in 3D space. Notice that only the X and Z coordinates are used indicating that 
only movement along a horizontal plane is permitted. 

AML scripts offer a number of advantages to animators - human or non- 
human. First, they give explicit control over the mutual synchronization of 
facial expressions, gestures, and speech by allowing start times and durations 
to be specified for each. This means animators are free to have even partial 
overlap of animation tracks starting before, together with, or after any other 
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<Body Animat ionTrack name="char, name of track "> 
<FacingAction bodypart="body I headonly"> 

<XCoor>f loat , targets X coordinate in meters</XCoor> 
<YCoor>f loat , target's Y coordinate in meters</YCoor> 
<ZCoor>f loat , target's Z coordinate in meters</ZCoor> 
</FacingAction> 

<PointingAction handconf ig="onef inger | open"> . . . 
</PointingAction> 

<WalkingAction mode= " default | run" > . . . </WalkingAction> 
<ResettingAction> . . . </ResettingAction> 

</Body Animat ionTrack> 



Fig. 4. ABML’s BodyAnimationTrack behaviors 



track. Second, as well as providing a basic set of animations, AML allows de- 
velopers to provide their own through the ExpressionFiles and User Animat ion 
nodes. 



4 CML and AML Applied 

In this section we present an animation scenario for a character (Sales Agent) 
moving toward an object and pointing at it. The aim is to demonstrate how 
a single scenario is scripted using both CML and AML, drawing out elements 
of believability attributes, gestures, and animation functionality, as well as 
issues of the overlaps between both languages in terms of functionality while 
differing in terms of tag-use granularity. 

4.1 CML 

In the examples below we demonstrate the use of simple CML high-level 
tags to script a walk animation from one position toward a target object. The 
extracts show CML’s base movement specifications. Using high-level tags pro- 
vides simplicity of scripting inhibiting voluminous lines of script, but limiting 
flexibility over final animation control. Alternatively, higher flexibility can be 
achieved by using low-level tags of specific MPEG-4 as demonstrated in AML. 

The associated figures below further show that dramatic differences are 
achieved in the way a single activity is animated by varying its believability 
attributes of emotion intensity and speed. The animations are governed by a 
defined mental state so that each gesture and behavior in which a movement 
is made is inherited from the state of emotion specified. This will affect the 
speed, the height of footsteps, hand movements and gestures, and overall 
behavior. 
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Fig. 5. Happy move and point animation 



Sample CML - Happy Move and Point Script 

<cml> 

<character name=" James" personality "extravert " role="psa" 
gender="m" base-animat ion-f ile="butler . liv"> 

<happy intensity="0 . 3" decay="0.5" target="goal" 
priority="l"> 

<move-to order="0" priority="0" speed="def ault" 
object= "product 1" /> 

<sync type="par" order="l" priority="0" > 

<point-to object="productl" /> 

<utterance> 

"I've found just what you vaulted ! Take a look." 
</utterance> 

</sync> 

< /happy > 

</character> 

</cml> 



Figure 5 shows a happy James walking with swift eagerness toward his 
target. Take a look at the next animation sequence (see Fig. 6). James is 
slow, not as eager when pointing and expressing an overall sad behavior. The 
reader may note that while there are distinct and recognizable behavioral 
differences between both animation sequences, there is little variation in the 
actual script, proving CML suitable for automated scripting. 

Sample CML - Sad Move and Point Script 

<cml> 

<character name="James" personality "extravert " role="psa" 
gender="m" base-animation-f ile="butler . liv"> 

<sad intensity "0 . 1" decay="0.3" target="goal" priority=" 1"> 
<move-to order="0" priority="0" speed="slow" 
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Fig. 6. Sad move and point animation 
object="productl" /> 

<sync type="par" order="l" priority="0" > 

<point-to object="productl" /> 

<utterance> 

"I'm afraid we’ve only got 2 bottles in stock!" 
</utterance> 

</sync> 

</sad> 

</character> 

</cml> 



4.2 AML 

The following example shows the use of AML to animate a Sales Agent in a 
3D telephone shop (see Fig. 7). The scenario is similar to the one described 
above with a Sales Agent walking toward and pointing to a product. Clearly, 
as compared to CML, the focus here is much less on believability aspects 
of character animation and more on scripting and controlling actions and 
behaviors in a 3D world in a synchronized manner. 

Sample AML - Walk and Point 

<AML face_id="x" body_id="y" root_path="c : \" name=" Point to phone"> 
<FA start _time = "00:00:000"> 

<TTS output_fap = speech. fap output _wav = speech. wav> 

<Text> "Let me show you another phone over here." </Text> 
</TTS> 

<AFML> 

<Settings> 

<Fps>25</Fps> 

<Duration>00 : 06 : 000</Duration> 
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<FAPDBPath> . \Expressions\</FAPDBPath> 
<SpeechPath> . \Speech\</SpeechPath> 
</Settings> 

<ExpressionsTrack name= "smile" > 

<Expression> 

<StartTime> 00:00:800 </StartTime> 
<ExName> smile. ex </ExName> 

<Envelope> 

<Point> 

<Shape>log</Shape> 

<Duration> 00:00:500 <Duration> 
<Int>l<Int> 

</Point> 

<Point>. . .</Point> 

<Point>. . .</Point> 

</Envelope> 

</Expression> 

</ExpressionsTrack> 

<SpeechTrack> 

<StartTime> 00:01:300 </StartTime> 
<FileName> speech. fap </FileName> 
<AudioFile> speech.wav </AudioFile> 
</SpeechTrack> 

</AFML> 

</FA> 

<BA start_time = "00:04:000"> 

<ABML> 

<Settings> 

<Fps>25</Fps> 

<BAPLibPath> . \BapFiles</BAPLibPath> 
</Settings> 

<Body Animat ionTrack name= " Walk " > 
<WalkingAction mode= "default" > 

<StartTime> 00:00:000 </StartTime> 

<Style type= "bap" > "walk. bap" </Style> 
<ControlPoint> 

<XCoor> -5 </XCoor> 

<ZCoor> 0 </ZCoor> 

</ControlPoint> 

<ControlPoint> 

<XCoor> 5 </XCoor> 

<ZCoor> 0 </ZCoor> 

</ControlPoint> 

</WalkingAction> 

<Point ingAct ion handconf ig= " open" > 
<StartTime>autoafter</StartTime> 
<XCoor>2</XCoor> 

<YCoor>l . 6</YCoor> 

<ZCoor>4</ZCoor> 
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Fig. 7. A Sales Agent pointing at a product 



<Speed>fast</Speed> 

<Priority>10</Priority> 

</PointingAction> 

</Body Animat ionTrack> 

</ABML> 

</BA> 

</AML> 

The above script instructs a virtual sales assistant to smile and say “Let 
me show you another phone over here”, after which the character walks to- 
ward a phone and points to it. Notice the start times of <FA> and <BA>. 
<A> is delayed by 4 seconds to let the character start talking before the 
walking starts. Start times of ExpressionTracks, SpeechTracks, and BodyAn- 
imationTracks are relative to the start times of <FA> and <BA>. The use 
of the relative synchronization indicator “autoafter” is also illustrated. This 
guarantees that the pointing action will not start before the walking action 
has finished. Other important features are the use of envelopes in <AFML> 
to “shape” the animation. These can be either linear, exponential, or loga- 
rithmic. Finally, there is the different syntax for the different body actions. In 
particular, the walking action defines control points to specify the trajectory 
the character should take while walking. In addition, however, AML allows 
for user-defined animations to be used to extend the basic animation set. The 
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following example shows the use of UserAnimation to make the Sales Agent 
look impatiently at its watch (see Fig. 8). 



Sample AML - Look at Watch 



<AML face_id="x" body_id="y" root_path="c : \" name="Look at watch"> 
<BA start _time = "00 : 00: 000" > 

<ABML> 



<Settings> 

<Fps>25</Fps> 

<BAPLibPath> . \BapFiles</BAPLibPath> 

</Settings> 

<Body Animat ionTrack name=" Impatient "> 

<UserAnimation type="bap" f ilename=" impatient .bap" > 
<StartTime>00 : 00 : 000</StartTime> 
<Speed>fast</Speed> 

<Intensity>l</Intensity> 

<Priority>l</Priority> 

</UserAnimation> 

</Body Animat ionTrack> 

</ABML> 

</BA> 

</AML> 



5 Discussion and Lessons Learned 

The chapter presented an account of two approaches to specifying scripting 
languages for character animation which are currently being developed and 
evaluated at Imperial College London, CML and AML. All attempts have 
been made to specify and develop mechanisms for the dynamic scripting of 
expressive behavior in embodied agents. The growing popularity of embodied 
agents will increase the demand on such languages in order to automate the 
animation process and allow non-technical character creators to quickly script 
believable animated behavior. Such languages will benefit applications that 
require real-time generation of animated behavior and control, such as virtual 
tutors and trainers , virtual presenters , conversational agents , virtual game 
characters , electronic personal assistants , and many more applications. 

In general, the role of these languages is to mediate and control language 
semantics between human and machine. Specifically, in the case of the lan- 
guages described in this chapter, the role is to script the animation of agent 
behaviors (external) and convey behavioral information between communicat- 
ing agents (internal). The language design is based primarily on operational 
semantics - where specified tags correspond to an operational rendering func- 
tion or set of functions that complete an animation - and partially on content 
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Fig. 8. An impatient Sales Agent 



semantics - where tags hold some behavioral or believability attribute, e.g. an 
emotion, a mood, a personality trait, etc. There is also a third dimension of 
semantics as identified by Piez [14] which is structural semantics - where a tag 
is the arbitrary relation between a signifier and a signified, and according to 
Piez’s understanding the layer in which a tag is defined in realtime may help 
to identify a contextual view of interpreting the tag. Markup languages are, 
nonetheless, semantic-less languages; however, interpreting a meaning from 
the defined tags will depend on where, when, and how tags are processed. 

Since there are a number of these languages emerging having both common 
and similar objectives, though taking different approaches to specify them, 
there is a need to compare and identify their key features, aiming at defining 
salient characteristics and requirements as a basis for setting the platform for 
possibly specifying a unified language. To approach this issue we suggest a set 
of factors that need to be present in a language for it to be successful. 



5.1 A Comparison 

In order to identify salient features for a unified language, we carried out a 
comparison of their capabilities. Through this analysis we propose a set of 
comparison metrics as defined in the table in Fig. 9. Details of the specific 
languages included can be found in the other chapters of this book. 
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Fig. 9. Comparison of features of various scripting languages 
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The comparison is based on five metrics that describe each language in 
terms of Control: defining the degree of decoupling between language and 
animation tools; Granularity: the detail of the taxonomy used for defining 
kinetic and behavioral animation; Flexibility: support for user-defined anima- 
tions and varying levels of control; Classification: categorization of the type 
of animation; Believability: the use of attributes of emotion and personality 
in order to make character animation more believable. 

To varying degrees all the languages support parameterized action and 
synchronization. APML [13], CML [1], MPML [19], and TVML [17] are built 
on the existing standard SMIL but also introduce additional parameters for 
higher control. 

All the languages appear to be complete in that they achieve the objectives 
they set out to meet. However, each language addresses a different issue: 
animation, character, and believability attribute representation, dialogue acts, 
and presentation. 

It was difficult to assess the extent of usability, consistency, and extensibil- 
ity of the languages surveyed due to a lack of open availability of full language 
specification and the associated tools and players for creating and visualizing 
affected behavior. Most of the languages are still on the design bench and 
under development. 

We propose a set of comparison metrics/parameters as defined in the table 
in Fig. 9. We start by defining the objectives of the language as mentioned 
above. Then we look at the format. XML was a common choice for most 
language specifications. Besides XML’s increasing popularity in application 
development and Internet support, it provisions for extensibility and syntactic 
correctness through XML parsers and validation. 

We then analyze the elements supported by each language. Most support 
animation and behavior definition. This is sometimes supplemented by ex- 
plicit voice control (most notably VHML) or definitions of the world, and 
objects within, that surround the agent. Here we interpret world information 
as either the elements of a presentation, as in the case of MPML, or full scene 
descriptions as for TVML. Beyond animation, some languages have been de- 
signed to support descriptions of characters as well. CML, for example, has 
been designed to be both a representation and scripting language. It targets an 
abstract annotation of character attributes which can be used in the internal 
reasoning process as well as a lower level of annotation for animation script 
generation. Finally, other languages are designed to represent dialogues. RRL 
[15] and APML are examples of these in that both languages are based on the 
definition of the communicative functions and relate these to the expressions 
effected. 

Related to animation we can compare markup languages on the level of 
control they provide to the animation process. As mentioned earlier most 
languages support synchronization and merging of animation tracks. However, 
some additional animation control is proposed. AML [9] provides a means of 
inhibiting body parts during an animation track so that they can be used 
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by another track. This feature is useful in avoiding conflict while merging 
animations. STEP [5, 6], like PAR [2] and VHML [18], introduces the idea 
of providing feedback to or awareness of the calling application. This feature 
can help the synchronization process. 

All languages aim at high-level abstraction and domain independence; 
however, MPML in particular models its language constructs on and is there- 
fore tightly coupled with the MSAgent technology. On the other hand, APML, 
CML, and VHML define their low-level tags based on MPEG-4 FAPs and 
BAPs. This introduces the issues of granularity of representation. We define 
Micro Elements such as FAPs and BAPs or “turn” and “move” in the case of 
STEP which are very low level and from which a significant set of animation 
tags/commands (Macro Elements) can be constructed. We also define an ex- 
tensibility metric. Most languages claim to be extensible, but in varying ways. 
In many cases, such as CML and VHML, it is a matter of constructing com- 
plex elements from existing macro elements. In the case of AML, MURML 
[8], and TVML it is a case of inserting new animations into the script or 
animation library. 

Including believability attributes of emotion and personality is essential to 
achieve convincing and realistic behavior. Here we specify emotion and person- 
ality as attributes of interest. Apart from MURML and STEP, all languages 
readily support one or both of these attributes with similar specifications. 
Noticeably there is a strong correlation between those languages that support 
character definition and/or dialogue acts and their support for believability 
attributes. 

Finally we found some divergence on the type of character and its con- 
stituent body parts that can be supported by the various languages. Most 
languages are restricted to human-like characters. This can include characters 
as long as they present human physical characteristics. In contrast, VHML, 
like CML and PAR, claims not to be restricted to the description of humanoid 
characters, but can support, for example, four-legged creatures. 

5.2 Towards a Unified Language 

As a number of such scripting languages now exist, there appears to be a need 
for the research community to look at and agree upon the requirements of and 
expectations from them. Here, we delineate some key objectives and general 
language specification requirements we deem necessary for such languages 
to meet their objectives. Based on the comparison in the previous section, 
we suggest specific language constructs that define the operational semantics 
needed for embodied agent scripting languages. 

Objectives of a Unified Language 

• Define a framework to decouple embodied agent animation tools and the 

underlying affect and planning engines. 
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• Establish a formal specification for unified/consistent interpretation. 

• Provide for modular development. 

• Create a markup language based on XML that allows users/agents to 
provide semantic and scripting annotations for handling and animating 
embodied agents. 

Language Requirements 

The design and formalization process of any language needs to fulfill a set of 
salient considerations defined so as to meet the criteria for general use and 
implementation, in addition to the set of criteria that fulfills the functional 
purpose of the language. 

The following key design criteria are identified for the development of such 
languages: 

• High-Level: Abstracted from the low-level technology elements yet retain- 
ing tags for low-level elements to allow detailed, flexible control. 

• Usability (Machine/ Human Legible): The language should be usable and 
easily implement able with multi-purpose applications and technologies. 

• Extensibility: Provisions for user-defined tags, and complex elements, share 
and reuse. 

• Parameterized Action Support: Provision for customized and dynamically 
generated scripts. 

• Synchronization Support: Modality control, merged animation. 

• Consistency: Provision for predictable control of the animation output 
regardless of the implementation and the platforms it will be run on. 

• Domain Independence: Not catering for any one domain, implementation 
application, or animation rendering tool. 

6 Conclusion 

In this chapter we described both the CML and AML languages. We also ad- 
dressed the current fragmentation in the field of embodied character animation 
and representation. There seems, however, to be an overriding agreement that 
efforts could be made toward unifying many of the approaches to character 
animation that have emerged in recent years. Following an analysis of generic 
language requirements we have looked at the main features of a representative 
set of current languages. Although similar in many ways, the chosen languages 
were developed following different approaches and thus present a variety of 
functions and capabilities. 

Resulting from this review we propose metrics for language comparison. 
These can be broadly categorized as format, specification elements, supported 
character types and modules, believability attributes, animation control, gran- 
ularity, and extensibility. Finally, we put forward some suggestions about the 
possible requirements for a unified language. 
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There are, of course, many open and more specific issues that still need 
answering. For example, what taxonomy for the affective and motion ele- 
ments should be used? What granularity of description should be targeted? 
Indeed, there is a trade-off to be made between higher levels of control (high 
granularity) and the complexity of the resulting language. Often these deci- 
sion have been influenced by the underlying technology used (e.g. MPEG- 
4 or MS Agent). Similarly, what affective and personality theories should be 
adopted to define the tags for affective expression? A decision has not yet been 
made within the research community, but striving toward a unified scripting 
and representation language may be the catalyst for much needed agreement. 
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Summary. Developing an embodied conversational agent able to exhibit a human- 
like behavior while communicating with other virtual or human agents requires 
enriching the dialogue of the agent with non-verbal information. Our agent, Greta, 
is defined as two components: a Mind and a Body. Her mind reflects her person- 
ality, her social intelligence, as well as her emotional reaction to events occurring 
in the environment. Her body corresponds to her physical appearance able to dis- 
play expressive behaviors. We designed a Mind-Body interface that takes as input 
a specification of a discourse plan in an XML language (DPML) and enriches this 
plan with the communicative meanings that have to be attached to it, by producing 
an input to the Body in a new XML language (APML). Moreover we have devel- 
oped a language to describe facial expressions. It combines basic facial expressions 
with operators to create complex facial expressions. The purpose of this chapter 
is to describe these languages and to illustrate our approach to the generation of 
behavior of an agent able to act consistently with her goals and with the context of 
the interaction. 



1 Introduction 

Humans communicate using verbal and non-verbal signals: body posture, ges- 
tures (pointing at something, describing object dimensions, etc.), facial expres- 
sions, gaze (making eye contact, looking down or up to a particular object), 
and using intonation and prosody, in combination with words and sentences. 
The way in which people communicate, and therefore the signals that they 
employ, is influenced by their personality, goals, and affective state and by the 
context in which the conversation takes place [12]. One very active research 
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area in the field of intelligent agents is devoted to constructing Embodied 
Conversational Agents (EC As). An EC A is an agent embedded in a virtual 
body that interacts with another agent (a human user or another virtual 
agent) in a human-like manner, and particularly in a believable way. Believ- 
ability is mostly related to the ability to express emotion [3] and to exhibit 
a given personality [20]. However, according to recent literature [36, 13], an 
agent is more believable if it can behave in ways typical of given cultures, 
and, finally, if it has a personal communicative style [5, 34]. Developing such 
a “computer conversationalist” that is able to exhibit these added dimensions 
of communication requires moving from natural language generation (NLG) to 
multi-modal behavior generation. One possible approach is to consider body 
and mind as strictly and necessarily interdependent. Another is to see them 
as mainly independent from each other. The first approach implies that the 
very planning of the meanings to convey is conceived by taking into account 
the possible signals. The second approach views an EC A as an entity consti- 
tuted by a “mind” and a “body”. At the mind level, only the meaning of a 
communicative action is represented, leaving to the body the task of decid- 
ing which signal to employ. In this case, in order to avoid signal redundancy 
or conflicts, it is necessary to write body-dependent rules at an intermediate 
sentence planning level. Using such rules has at least two advantages: first you 
can adapt different bodies to the same mind (say, the expertise and conver- 
sational capacity of a doctor can be conveyed by a beautiful girl or by an old 
white-haired man); second, the same body may take on different behaviors, 
determined by individual differences, due to culture, communication style, 
personality [25], in expressing the same meaning. To construct the architec- 
ture of our EC A, in the context of the EU project MagiCster 5 , we adopt this 
second approach. The ‘Mind’ and ‘Body’ are interfaced by a language based 
on XML, so as to overcome integration problems and to allow their inde- 
pendence and modularity. During the conversation, the agent’s Mind decides 
what to communicate considering different factors that trigger the goal of 
communicating and influence the contents to communicate (cognitive state, 
emotions, context, user sensitivity, and so on). At each given moment of a 
communicative interaction, all of these aspects combine with each other to 
determine what the agent will say [11]. The Body “reads” what the Mind 
decides to communicate and interprets and renders it at the surface level, 
according to the available communicative channels. Also this step may be 
influenced by several factors (personality, style, social identity, culture) that 
determine which combination of verbal and non-verbal signals is the most fit 
to express a particular communicative goal. 

5 1ST Project IST-1999-29078, partners: University of Edinburgh, Division of Infor- 
matics; DFKI, Intelligent User Interfaces Department; SICS; University of Bari, 
Dipartimento di Informatica; University of Rome, Dipartimento di Informat ica e 
Sistemistica; AvatarMe. 
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To achieve a rich expressiveness, the output of the agent’s Mind cannot be 
just a combination of symbolic descriptions of communicative acts. It should 
include, as well, a specification of the “meanings” that the Body will have to 
attach to each of them. These meanings include the communicative functions 
that are typically used in human-human dialogues: topic-comment, affective, 
meta-cognitive, performative, deictic, adjectival, and belief relation functions 

[31]- 

To achieve this granularity in expressing believable behaviors, we define 
a set of languages for specifying the format of dialogue moves at different 
abstraction levels. In particular, to specify the format of the dialogue move 
that should act as an interface between the agent’s Mind and its Body, we 
designed a Mind-Body interface. This interface takes as input a specification 
of a discourse plan in an XML language (Discourse Plan Markup Language, 
DPML) in which only communicative goals and relations between these goals 
are specified and generates as output a formalization of the agent behavior in a 
new XML language called APML (Affective Presentation Markup Language) 
able to express the content of the dialogue move at the meaning level. In this 
way, the task to interpret how to render each meaning, or a combination of 
meanings, at the surface level, can be left to body-specific wrappers. At the 
signal level, we have developed a language to describe facial expressions. Sig- 
nals may be described recursively (a signal may be defined by the combination 
of already defined signals) or by specifying all the parameters (facial actions). 
These descriptions are understandable and thus get interpreted by the facial 
player that produces the animation of the agent. 

In this chapter, after describing the main features of the underlying ar- 
chitecture, we present the APML language and how it has been used, in the 
context of the MagiCster project. In particular, we will show how it has been 
interfaced with a 3D realistic face called Greta [24] and a synthetic voice [4]. 
To illustrate the approach, we will use an example in the medical domain. 
Conclusions will be discussed in the last section. 



2 Expressing Believable Behaviors 

How does communication originate? Communication is a means to influence 
others [8, 7]. A system engages in an act of communication as it has the 
goal to influence another system; that is, to cause another system to have 
some goal it does not have, or to refrain from a goal it has. There are many 
different ways to influence others: strength, seduction, aggression, being an 
example to imitate, inducing emotions. The peculiar way in which a system 
influences another system through communication is through providing beliefs 
to it. Any time a system S (Sender) communicates something to a system A 
(Addressee), S provides A with beliefs about A’s goals, and through this it asks 
A to “adopt” S’s goal; that is, to pursue it as if it were a goal of A’s. Any time 
we communicate, we provide others with beliefs about our goals in order to 
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have them pursue our own goal. For example, as S tells A “Take an Aspirin” , S 
provides A with a belief about S’s goal (the goal that A does some action, and 
specifically the action of taking an Aspirin), in order to have A pursue S’s goal; 
that is, to have A take the Aspirin. All communicative acts - speech acts, like 
a sentence or a discourse, but also non-verbal acts, like a gesture, a gaze, or a 
facial expression - provide beliefs about the Sender’s goal; that is, about what 
action the Sender wants the Addressee to take: so any act of communication 
is a way to ask others some action, but we may ask them to do different 
types of actions. With a request, like “Take an Aspirin”, S asks A to do some 
action; with a question, like “Did you take an Aspirin?”, S asks A to do a 
particular action, the action of providing S with information; with Information 
like “Aspirin relieves the pain”, S is still asking A to do some action: the 
cognitive action of believing what S is saying. Since any act of communication 
provides the Addressee with a complex belief about the Sender’s goal, and the 
Sender’s goal is that the Addressee do some specific action - doing, providing 
information, believing - the act of communication must also mention all the 
beliefs that specify the Sender’s goal. First, S’s goal must be specified into a 
performative [2, 30], that is a specific goal that claims some particular social 
relationship between Sender and Addressee (“Take an Aspirin” may be an 
order if A is my child, advice if he is a friend of mine); second, the act of 
communication must also specify what is the particular action or information 
requested, or what the beliefs are S wants A to believe. Therefore, the minimal 
unit of communication is a communicative act that is made up of two parts, 
two packages of beliefs: a performative and a propositional content. Any time 
we communicate we must have conceived of at least these two packages of 
beliefs, and to the extent to which these beliefs are beliefs we have the goal 
to communicate to an Addressee, we can say they form a meaning. 

A meaning can be viewed, then, as a set of beliefs that a system has the 
goal to transmit to another system; that is, belief S has the goal that also A 
believes. And of course, the meanings S may have the goal to transmit to A 
may even be very complex meanings, so they may need to be packaged into 
several different communicative acts that make up a complex communicative 
act. This is what happens when a sentence is not enough but we need to resort 
to a whole discourse, a novel, a handbook, a theater performance, a film, to 
specify all the meanings we mean. But since beliefs are simple information, not 
physical patterns of matter or energy, how can S cause these beliefs (meanings) 
to pass from S’s mind also to A’s mind? To do so, the immaterial meanings 
must be linked to perceivable stimuli that we call communicative signals. 
Each meaning or set of meanings must be linked to a particular signal or set 
of signals, and both S and A must share a common system of communication; 
that is, a system that states how meanings and signals correspond to each 
other. Therefore, any time a system S has to communicate a set of meanings 
to another system A, it has to find out, in the communication system it 
assumes to be shared with system A, the specific signals that correspond to 
the meanings to convey. 
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Let us first overview the beliefs that may form the content of a commu- 
nicative act. Three classes of meanings can be distinguished [29]: 

Information about the World As we communicate, we provide informa- 
tion about concrete or abstract events, their actors and objects, and the 
time and space relations among them. Such information is provided mainly 
through words, but also by gestures or gaze. 

Information about the Speaker’s Identity Physiognomic traits of our 
face, eyes, lips, the acoustic features of our voice, and often our pos- 
ture provide information about our sex, age, socio-cultural roots, and 
personality. And, of course, our words can inform the addressee of our 
Self-Presentation: that is, the way we want to present ourselves. 
Information about the Speaker’s Mind While mentioning events of the 
external world, we also communicate why we want to talk of those events, 
what we think and feel about them, how we plan to talk about them. We 
provide information about the beliefs we are mentioning, our own goals 
concerning how to talk about them, and the emotions we feel while talking 

[29]. 

Here we focus on some of them, which are implemented in our ECA, Greta. 
Information about the world includes: 

1. Deictics: To mention the referents of our discourse, we may point at them 
by deictic gestures or gaze. 

2. Adjectival: To refer to some properties of objects we may use iconic or 
symbolic gestures and even gaze (as when we narrow our eyes to mean 
“small” or “difficult”). 

Within information about the Speaker’s Mind, and particularly informa- 
tion about the Speaker’s beliefs, we inform about: 

1. degree of certainty: words like “perhaps”, “certainly”; conditional or sub- 
junctive verb modes; but also frowning, which means “I am serious in 
stating this”; opening hands, which means “This is self-evident”; 

2. metacognitive information: that is, the source of mentioned beliefs, whether 
they come from memory, inference, or communication (we look up when 
trying to make inferences, snap fingers while trying to remember,etc.). 

Considering goals we inform about: 

1. performativity of the sentence (by performative verbs, intonation, facial 
expression) ; 

2. topic-comment or theme-rheme distinction (by batons, eyebrow raising, 
voice intensity, or pitch contour); 

3. rhetorical relations: class-example (saying first, second, third, and so on; 
counting on fingers) topic shift (expressed through posture shift); 

4. turn-taking and backchannel: raise hand for asking turn; nod to tell the 
Interlocutor we are following what he or she says. 
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Finally, we inform our addressee about the emotions we feel while talking 
(by affective words, gestures, intonation, facial expression, gaze, and posture). 

Emotions may be implied in communication in at least two ways: (i) they 
may be the very reason that triggers communication: we activate the goal of 
communicating just because we want to express our emotion; (ii) they may 
intervene during our communication, as a reaction to what our interlocutor 
is saying, or to some thought suddenly coming to our mind, either related to 
the ongoing dialogue or not. 

In both cases, the triggering of emotion does not necessarily imply that the 
Agent displays it. There are many reasons why we may refrain from express- 
ing our emotion, and the final (aware or non-aware) decision of displaying it 
may depend on a number of factors [32, 10]. Some of them concern the very 
nature of the emotion felt (emotional nature) , others the interaction of several 
contextual (scenario) factors. 



3 MagiCster Architecture 

In order to illustrate which are the requirements that our ECA has to fulfill, 
we will start by illustrating an example that will be used through the chapter, 
of advice-giving dialogue in the medical domain (Table 1) in which the Agent 
moves are denoted with Gi and the User moves with Uj. 



Table 1 . An example of dialogue in the medical domain 

GO: Good morning Mr. Smith. 

Ul: Good morning Doctor Greta. 

Have you seen my tests? 

Gl: Yes, and I’m sorry to tell you that you have been diagnosed 
as suffering from angina pectoris, which appears to be mild. 

U2: What is angina? 

G2: Angina is a spasm of the chest resulting from overexertion when 
the heart is diseased. 

U3: Is it possible to cure it? 

G3: Yes, a drug therapy does exist. To solve your problem, you 
should take two drugs. The first one is Aspirin and the second 
one is Atenolol. 



In this dialogue, the Agent (named Greta) plays the role of a doctor and the 
Interlocutor is a patient asking for information about his disease. As explained 
in the previous section, in order to show believable behavior, the Agent has 
to act consistently with her role, mental state, goals, personality, and social 
context; this is especially important in delicate conversational fields such as 
medical advice. In addition, the Agent has to decide whether an emotion is 
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felt and, according to the interaction context, whether it has to be conveyed 
at the signal level. 

For instance, while conversing with the patient (see the example in Table 
1), Greta will coordinate her speech with various expressions: 

• In move Gl, she manifests her empathy with the User. She does it not 
only verbally (“I’m sorry to tell you”) but also non- verbally, by displaying 
the expression of “sorry- for” . To play down the seriousness of the illness, 
Greta will emphasize both verbally and non- verbally the fact that it is still 
in a “mild” form. 

• In move G2, Greta indicates her chest while saying “a spasm of chest” 
while, in turn G3, she looks at the User while saying “your problem” . The 
two expressions are realized through a particular gaze direction that plays 
a deictic function to indicate a given point in space. 

Let us see now how the MagiCster’s architecture supports the generation 
of dialogues of this kind. 





Fig. 1. MagiCster system architecture 



The architecture of the MagiCster system is shown in Fig. 1. It is made up 
of two main components (a Mind and a Body), interfaced by a Plan Enricher. 
The Agent’s Mind includes a Content Planner, a Dialogue Manager, and an 
Affective Agent Modeling module. This module is responsible for updating 
the Agent’s mental state, that is her goals and beliefs. The Body is a 3D 
face/avatar, with a speech synthesizer [4] for animated spoken delivery. We 
will briefly describe each module, to focus our description on the Mind-Body 
interface. 

The Affective Agent Modeling module decides whether a particular 
affective state should be activated and with which intensity and whether the 
felt emotion should be displayed in a given context [12]. 

The Content Planner is responsible for the generation of the discourse 
plan appropriate to the context [9] . At this level, the emphasis is on the goals 
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that the Agent has to achieve in that piece of conversation. No information 
about how to express them in terms of agent behavior is represented in the 
discourse plan. According to Moore and Paris [22], it is a tree identified by 
its name; its main components are the nodes that are identified, as well, by 
a name; nodes include mandatory attributes describing the communicative 
goal, the discourse focus, and the rhetorical elements (role in the Rhetorical 
Relation (RR) of the father-node and rhetorical relation). The DPML DTD 
is as follows: 

<! ELEMENT d-plan (node+)> 

< ! ATTLIST d-plan 

name CDATA #REQUIRED 

> 

<! ELEMENT node (node*, info*)> 

<! ATTLIST node 

name CDATA #REQUIRED 
goal CDATA #REQUIRED 
role (root I nucleus I sat) #REQUIRED 
RR CDATA #REQUIRED 
focus CDATA #REQUIRED 
> 

The discourse plans are represented as XML-based structures for the following 
reasons. First, it enables us to build a library of standard conversation plans 
in the medical domain that can be instantiated when needed, to be used in 
any application context (text, hypertext, voice, and so on). Second, XML pro- 
vides a standard interface between the generator modules, to favor resources 
distribution and reuse. 

The Dialogue Manager is built on top of the TRINDI architecture [18], 
which provides an engine for computing dialogue moves and a space in which 
information relevant to the move selection and effect can be stored. Such 
information may be, for instance, the Agent’s mental state and the current 
plan. After a plan has been chosen from the library of plan recipes, the first 
Agent move is generated according to the first step of this plan. When the 
Agent is in dialogue with a User, the dialogue starts and the DM controls its 
flow by iterating the following steps, until the conversation ends [25]: 

1. the initiative is passed to the User, who can ask questions on any of the 
topics under discussion; 

2. the User move is translated into a symbolic communicative act (through 
a simplified interpretation process) and is passed to the DM; 

3. the DM decides “what to say next” by selecting a plan/subplan, achieving 
the selected communicative goal to execute. 

The Plan Enricher translates the symbolic representation of a dialogue 
move into an Agent’s behavior specification at the meaning level. A dialogue 
move may be a “primitive” communicative act (for instance, a “greet”, a 
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“thanks”, an “inform”, a “request”) or a more complex plan (for instance, 
“Describe an object with its properties”), annotated according to DPML. The 
choice of the appropriate communicative act (an implore rather than an order; 
showing or not showing an emotion) is based on the conversational context 
[11]. An algorithm translates this DPML-based treestructure into another 
XML-based language (APML), through a set of transformation rules that 
depend on the information attached to nodes in the discourse plan: rhetorical 
relation name and type, communicative goal, discourse focus, and so on. 

The Face and Body Animation interprets the APML-tagged dialogue 
move and decides how to convey every meaning (by which combination of 
signals). As mentioned previously, the Body we use at present is a combination 
of a 3D face model compliant with the MPEG-4 standard [24] and a speech 
synthesizer [4]. 



4 A Markup Language for Behavior Specification: 

APML 

Believable conversational agents of the kind described above have motivated 
a number of markup languages used to provide meta information such as 
control and intent. These languages differ mainly in the level of abstraction 
of the representation and specification they provide. Most existing languages 
allow agent specification at the signal level, depending most of the time on 
the type of “body” supporting the behavior expression. 

For the reasons given in Sect. 2 of this chapter, we have developed a 
set of XML-based languages that include high-level primitives for specifying 
behavior acts, similar to those performed by humans, in order to express agent 
behavior at different levels of abstraction and to control easily the behavior of 
ECAs independently of the body. In this section we will describe APML (the 
Affective Presentation Markup Language), whose purpose is to specify the 
agent behavior at the meaning level. In particular, as its name suggests, the 
emphasis in APML is on the affective aspect of the communication between 
the Agent and the User. Before looking at APML tags in details, let’s briefly 
overview other relevant markup languages aiming at describing and specifying 
human-like behaviors. 

4.1 Overview of Existing Markup Languages for Expressing 
Human-Like Behavior 

An effort toward building a standard markup language is represented by the 
Human Markup Language [15]. This language allows one to specify human 
communicative behaviors at a very high level. The aim of HML is to “develop 
Internet tools and repository systems which will enhance the fidelity of human 
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communications” [15]. Its specification modules include tags allowing the rep- 
resentation of physical, cultural, social, kinetic, psychological, and intentional 
features used by humans in communicating information. 

The information encoded by HML is at a very abstract level: using it for 
controlling specific agent bodies may be difficult and may require developing 
complex interpreters to translate a very abstract specification into low-level 
body actions. For this reason, researchers tend to develop their own languages, 
more suited to the type of embodied agent they wish to control. 

Another example is VHML [37] that gathers several languages, each of 
them acting on different modalities: some tags are related to facial expres- 
sions, to gesture, to emotion (EML) but also to dialogue management, synthe- 
sis speech, and so on. The language offers a large variety of tags: for example, 
tags representing signals (right raised eyebrow) or tags representing emotion 
(“happiness”). But the language does not implement the Mind-Body separa- 
tion that we advocate. 

Some of the earliest work on developing a language for specifying life- 
like character behavior is represented by MPML (Multimodal Presentation 
Markup Language). MPML has been developed with the aim of enabling 
authors of web pages to add agents for improving human-computer interaction 
[16]. Its design has been driven by the choice of Microsoft Agent as a body. 
For instance, the tag for specifying a predefined animation sequence (<act>) 
takes, as a possible value, one of the MS-agent’s animations. 

More recently, BEAT (Behavior Expression Animation Toolkit), another 
XML language designed for generating embodied agent’s animation from tex- 
tual input [6], has been used for tagging both the Agent’s input and its output. 
The input is an utterance that is parsed into a tree structure; this tree is ma- 
nipulated to include information about non-verbal signals and then specified 
again in XML. The toolkit is then able to generate appropriate and synchro- 
nized non-verbal behaviors and synthesized speech specified by the output 
language containing tags describing the type of animation to be performed 
and its duration. 

At this level of specification, Avatar Markup Language (AML) [1] repre- 
sents a new high-level language to describe avatar animation. It is based on 
XML and encapsulates the Text to Speech, Facial Animation, and Body Ani- 
mation in a unified manner with appropriate synchronization. Other examples 
of such languages are RRL [28], CML [1], and MURML [17]. 

As we will see in more detail in the next section, APML differs from these 
languages in having been designed to represent the meaning level in which 
communicative goals are translated into communicative functions. 

4.2 Defining APML Tags 

Poggi et al. [31] define a communicative function as a (meaning, signal) pair, 
where the meaning item corresponds to the communicative value of the signal 
item. For instance, a smile can be the signal of the emotion “joy” or of a 
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backchannel. This distinction between the meaning and the signal, that is the 
way in which the meaning can be communicated, has driven the design of 
APML. Due to the architectural choice of Mind-Body separation, tags should 
not specify the signal to be conveyed but only the meaning associated with a 
given communicative act. 

The APML (Affective Presentation Markup Language) DTD is: 

<! ELEMENT APML (turnallocation* , performative*, 
turnallocation*) > 

<! ELEMENT turnallocation (performative*) > 

< ! ATTLIST turnallocation 

type (take I give) #REQUIRED 

> 

<! ELEMENT performative (theme I rheme) +> 

<! ATTLIST performative 

type (greet I inform I paraphrase I suggest I ask I . . . ) #REQUIRED 
affect (sorry-for I relief I joy | happy-f or I . . . ) #IMPLIED 

> 

<! ELEMENT theme (#PCDATA I emphasis I boundary) *> 

<! ATTLIST theme 

affect (sorry-for I relief I joy | happy-f or I ... ) #IMPLIED 
belief -relation (gen-spec I cause-effect I solut ionhood I 
suggestion! justification! ...) #IMPLIED 

> 

<! ELEMENT rheme (#PCDATA I emphasis I boundary) *> 

<! ATTLIST rheme 

affect (sorry-for I relief) #IMPLIED 

belief-relation (gen-spec I cause-effect I solutionhood I 
suggestion I modifier I justification) #IMPLIED 

> 

<! ELEMENT emphasis (#PCDATA)> 

<! ATTLIST emphasis 

level (strong | medium I weak) "medium" 

x-pitchaccent (Hstar I Lstar I LplusHstar I LstarplusH I 

HstarplusL I HplusLstar) "Hstar" 

deictic CDATA #IMPLIED 

adjectival (small I tiny) #IMPLIED 

> 

<! ELEMENT boundary EMPTY > 

<! ATTLIST boundary 

type (L | H I LL | HH | LH | HL) "LL" 

> 

We are showing here the DTD instead of the XML Schema for space 
reasons, since Schemas have a less compact representation than DTDs. 
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Every dialogue turn specified with this language starts with the root tag 
<APML>. To indicate that the agent is taking or giving the initiative, the 
turn- allocation tag can be used: its type attribute can take the value “take” 
or “give” . For instance, in the following APML sentence the Agent is starting 
the conversation by taking the turn and greeting: 

<APML> 

<turnallocation type="take"> <perf ormative type="greet "> 
<rheme>Good<emphasis x-pitchaccent="Hstar">morning</emphasis> 
Mr Smith . <boundary type="LL"/x/rheme></perf ormative> 
</turnallocation> 

In order to specify the type of performative the homonymous attribute can be 
used. It may take one of the values specified in the DTD (e.g. suggest, inform, 
and so on). For instance, in the previous example it is a “greet”. 

The tags <theme> and <rheme> refer to the information structure of the 
phrase. The theme corresponds to the topic or part of the utterance that links 
it to the preceding discourse. Often themes are completely given or old infor- 
mation, and can be uttered without word emphasis, or even elided completely. 
The rheme is the part of the utterance that moves the discourse forward by 
providing needed information relevant to the theme. By its nature the rheme 
must contain new information, usually carries some emphasized words, and 
cannot be elided; that is, it is the focus of the discourse. Performative tags 
may embed theme and rheme structures. In particular, affect expresses the 
emotion that has been triggered in the Agent’s Mind module. Affect tag may 
be associated to an entire performative, to a theme, or to a rheme. The belief- 
relation attribute takes as a value the name of the RR present in the DPML 
specification. We will give an example of APML generation in Sect. 6. 

Communication functions may be synchronized at different levels [26] . Per- 
formative type usually spans the whole communicative act. Other communica- 
tive functions modulate a single word or semantic element of the utterance 
and usually last only the time of that word or semantic element. The affec- 
tive and the belief-relation functions are synchronized with the information 
structure of a discourse. They may be represented as attributes of the theme 
or rheme tags; the other communicative functions (adjectival, deictic) have 
a more local character as they act on the word(s) they refer to and would 
correspond to separate tags. 

Intonation is specified using <emphasis> and <boundary> tags. These 
tags follow the ToBi notation [27]. The <emphasis> tag applies to lexical 
words, and identifies these words as new or contrastive information, contribut- 
ing to distinguishing the theme or rheme that they occur in from other themes 
and rhmenes that may be actually or portentially in play in the discourse. 
Emphasis is realized as pitch accent in the intonation contour and by vari- 
ous facial and manual or head gestures. Pitch accent type is determined by 
theme-rheme status. For domain’s like the present one, theme accents (where 
needed) are realized as L+H* accents, while rheme accents are realized as H*. 
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The values of the boundary tones depend on the relation between successive 
intonational phrases, the syntactic characteristic of the sentence (e.g. as an 
interrogative) , or on subtle aspects of hearer or speaker orientation of the in- 
formation unit [35, 33]. In practice, most themes in discourses like those we 
treat here bear an LH% or “continuation rise” boundary, and most rhemes 
bear low LL% boundaries. 



5 Facial Description Language 

Humans are very good at showing a large spectrum of facial expressions; but 
at the same time, humans may display facial expressions varying by very sub- 
tle differences, but whose differences are still perceivable. We have developed 
a language to describe facial expressions as (meaning, signal) pairs. These ex- 
pressions are stored in a library. When the planner enriches the discourse move 
with a communicative meaning, the program looks in the library to which sig- 
nals it corresponds and the APML tag is instantiated by the corresponding 
signal values. Defining facial expressions using keywords such as “happiness, 
raised eyebrow, surprise” does not capture these slight variations. In our lan- 
guage, an expression may be defined at a high level (a facial expression is a 
combination of other facial expressions already predefined) or at a low level (a 
facial expression is a combination of facial parameters). The low-level facial 
parameters correspond to the MPEG-4 Facial Animation Parameters (FAPs) 
[24] . The language allows one to create a large variety of facial expressions for 
any communicative functions as well as the subtleties that distinguish facial 
expressions. Paradiso and L’Abbate [23] have established an algebra to create 
facial expressions. The authors have elaborated operators that combine and 
manipulate facial expressions. Our language is created with the sole purpose 
of creating the facial expressions that are associated with a given communica- 
tive function. We have worked out a method to combine facial expressions due 
to distinct co-occurring communicative acts using a Bayesian network [26] . 

We consider two items: “facial basis” (FB) and “facial display” (FD). An 
FB is a basic facial movement such as right raised eyebrow, upper lip raise, 
jaw opening, left upper eyelid lowered, and so on. FBs also include eye and 
head movements such as nodding, shaking, turning the head, and the eyes. An 
FB may be represented as a set of MPEG-4-compliant FAPs or, recursively, 
as a combination of other FBs (see Fig. 2). 

Every facial display (FD) is made up of one or more FBs (see Fig. 3). For 
example, we can define the “surprise” facial display as: 

surprise = raised-eyebrow + raised-lid + open-mouth; 

We can also define an FD as a combination of two or more (already) defined 
facial displays using the “+” and “*” operators. For instance, the “worried” 
facial display is a non-uniform combination of “surprise” (slightly decreased) 
and “sadness” facial displays (see Fig. 4): 
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Fig. 2. The combination of “left raised eyebrow” (left) and “right raised eyebrow” 
(centre) produces a raised “eyebrow” movement (right) 




Fig. 3. The “raised eyebrow” expression (left) and its more intense equivalent (right) 



worried = ( surprise * 0.7) + sadness ; 

with: surprise = raised -eyebrow + raiseddid + open-mouth] 
raised -eyebrow = leftjraise-eyebrow + right -raise -eyebrow] 
left -raise -eyebrow ={fap31 = 50, fap33 = 100, fap35 = 50}; 
right -raise -eyebrow ={fap32 = 50, fap34 = 100, fap36 — 50} 



6 An Example 

In this section, we derive an example taken from a medical domain applica- 
tion. The User may converse with the Agent and ask her about his physical 
condition. The Dialogue Manager (DM) elaborates a discourse plan by con- 
sulting the domain model. This plan is then enriched by the plan enricher 
that translates a DPML-based treestructure into an APML-based structure. 

Let us suppose the User is asking about the severity of his disease. The 
DM selects the following dialogue move whose DPML recipe is: 

<node name="nl" goal="Explain(Has (U, disease) ) " 
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jyrprise sadness worried = surprisex0.7 + 

sadness 

Fig. 4. The combination of “surprise” (left) and “sadness” (centre) produces a 
“worried” facial display (right) 



role="root" f ocus=" disease "RR="ElabObj Attr"> 

<node name="n2" goal="Inf orm(Has (U, disease))" 
role="nucleus" f ocus="Has (U, disease)" RR="null"/> 

<node name="n3" goal="Inf orm (Severity (disease) ) " 
role="sat" focus="Severity (disease) " RR="null"/> 

</node> 

Then, given a DPML tree, the transformation algorithm (called MIDAS) 
reads it recursively down to the leaves. 

Each dialogue move starts with the root tag <APML>. As indicated by 
the DTD specifications, this tag may be followed by a <turn-allocation> or 
by a < performative > tag. 

A turn-allocation function does not have to be performed for every dialogue 
move; it is needed to indicate the exchange of speaking turn. Then, taking into 
account the dialogue history and the focus shift, it is generated for the first 
and the last dialogue move and when the User changes the focus of discourse. 

Every leaf node is transformed into a < performative > element. The corre- 
sponding verbal sentence may contain both <rheme> and < theme > or only 
one of these. 

MIDAS recursively reads the DPML structure of the dialogue move to 
generate the performative tags. Since DPML is driven by RRs then, in or- 
der to instantiate the appropriate <performative>, we consider RR-driven 
transformation rules. The value of the “RR attribute” attached to the node 
activates the proper recursive schema. 

According to Mann et al. [21], RRs can be classified into two families: 
subject-matter and presentational relations. 

Subject-matter relations are defined as “those whose intended effect is that 
the reader recognizes the relation in question” (elaboration, solutionhood, 
summary, circumstance, contrast, etc.). 
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Presentational relations are “those whose intended effect is to increase 
some inclination in the reader, such as the desire to act or the degree of positive 
regard for, belief in, or acceptance of the nucleus” (motivation, background, 
enablement, evidence, justification). 

We follow this classification to derive the general rule of putting the belief- 
relation emphasis on the RR marker in the satellite on the nucleus-satellite 
subject-matter relations only. In particular, in order to emphasize the rela- 
tion holding between the subject and the matter we put the belief-relation 
attribute in the theme of the performative representing the satellite. In the 
case of multinuclear subject-matter RRs (i.e. Ordinal Sequence, Contrast, 
etc.), the RR marker is set in the theme of the performative representing the 
nucleus. If the RR is a presentational one, then to increase the reader’s belief 
on what is being stated in the nucleus, we put the belief-relation attribute in 
the rheme of the performative representing the nucleus. According to these 
rules we defined the MIDAS transformation schemas as follows: 

Current _node=nl IF Current.node . role=root ==>write("<APML>") 
IF 

checkTa ==> write ("<turn-allocat ion type="take">") 

IF 

Current_node . RR=SM_NS ==> 

{Midas (current .node .nucleus , NULL, NULL) 

Midas (current.node . satellite , ’theme’, Current.node .RR) } 

IF 

Current .node . RR=SM_NN ==> 

Forall current .node . nucleus : 

Midas (current.node .nucleus , ’theme’, Current.node . RR) 

IF Current.node .RR=PRES ==> 

{Midas ( current.node . nucleus , ’ rheme ’ , 

Current.node .RR) Midas (current.node . satellite , NULL, NULL)} 

IF 

Current.node . RR=NULL ==> 

Performative. generate (Current.node, brEmph,RR) . . . 

Let us again consider the DPML structure in the above example. The root 
node is nl and its RR is the ElabObjAttr; it includes a nucleus, which men- 
tions an object, and a satellite that describes a property of that object. It 
corresponds to a subject-matter relation holding between a nucleus and a 
satellite (SM_NS). Thus, a recursive call is first made on the nucleus with 
the belief-relation emphasis (brEmph) parameter and its RR value set to 
“NULL”. Then, a recursive call is made on the satellite with the brEmph 
parameter set to theme and the RR value set to the RR of the current node. 
This indicates that the belief-relation attribute has to be set in the theme of 
the second performative being the satellite of the RR. 

When the algorithm reaches a leaf node, the Performative. generate (node, 
brEmph, RR) function is called and the recursion ends. This function is re- 
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sponsible for the surface realization in which the <performative> element is 
generated. Its type attribute is set to the type of speech act present in the 
DPML node goal. If the Affective Agent Modeling component establishes that 
an emotion has been triggered at the current node 6 and that this emotion has 
to be displayed [11], its affect attribute is set to that emotion name. 

Besides generating the <performative> tag with its attributes, this func- 
tion produces the verbal part of the speech act and includes, if needed, two 
more tags: the <theme> and the <rheme> ones. According to the values of 
its parameters, the belief-relation attribute is set appropriately. For instance, 
in the above-mentioned example, the type of the performative tag represent- 
ing the satellite n2 is set to “inform”. In this case, since the Agent feels a 
“sorry-for” emotion, the affect attribute of the performative gets this value. 
Since the RR of the father node is a subject-matter relation according to the 
transformation rules, the nucleus performative does not get any belief-relation 
emphasis and the following annotated sentence is generated: 

<perf ormative 

type="inf orm" af f ect="sorry-f or"> <theme> I’m sorry to 
<emphasis x-pitchaccent="LplusHstar">tell</emphasis> you 
<boundary type="LH"/> </theme> <rheme> that you have been 
<emphasis x-pitchaccent="Hstar"> diagnosed </emphasis> as 
<emphasis x-pitchaccent="Hstar">suf f ering</emphasis> from 
what we calKemphasis x-pitchaccent="Hstar">angina</emphasis> 
<emphasis x-pitchaccent="Hstar ">pectoris , < /emphasisX /rheme > 
</perf ormative> 

Both theme and rheme may contain emphases of various types. The set- 
ting of the intonational tags are derived from the DPML structure, domain 
information, and discourse history. We may have the following cases: 

• Emphasis on the word indicating the performative. For instance, 
in a theme: <emphasis x-pitchaccent= “LplusHstar” > tell < /emphasis >. 

• Emphasis on the attentional elements of discourse [14]. For in- 
stance, from Has (U, angina) (which constitutes the rheme of this ut- 
terance) MIDAS puts emphases on the words suffering and angina 
pectoris (in the lexicon a single word), since these contribute to dis- 
tinguishing this rheme from other conditions and relations to them. 

• Emphasis on adjectival communicative function. When the argu- 
ment of the communicative goal is a quantitative attribute of the discourse 
focus, the tag’s emphasis with the attribute adjectival is attached to the 
argument. For instance, in the following APML sentence, “severity” is 
a quantitative property of angina which is also the discourse focus, and 
thus is marked with an emphasis tag: <emphasis x-pitchaccent=“Hstar” 
adjectival^ “small” > mild </emphasis>. 



We view a discourse element as an event. Such an event may or may not trigger 
an emotion. 
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• Emphasis on deictic communicative function. When the argument 
of the communicative goal is described in the domain knowledge base 
as “referenceable through its coordinates” and this argument is also 
the discourse focus, then we have the following tag type: <emphasis x- 
pitchaccent= “Hstar” deictic= “chest” > chest < /emphasis >. 

Once the generation of the first leaf of the tree (the first performative) 
ends, the recursive call on the subtree starting from the satellite node starts. 
According to the rules just explained, the theme receives the belief-relation 
emphasis and the following tagged phrase is generated: 

<perf ormative type= "inform 1 ^ 

<theme belief-relation="ElabObj Attr">which </theme> 

<rheme> appears to be <emphasis x-pitchaccent="Hstar" 
adject ival=" small ">mild</emphasis> 

</rheme> </perf ormative> 

The APML tags are instantiated by their facial signals by looking them 
up in the library associated with the Agent Body. In this example “certain” 
corresponds to a frown and the tag “sorry- for” to the signals: inner eyebrow 
up, head aside, mouth corner down. There is a conflict in the eyebrow region. 
Our system resolves the conflict that may occur when more than one com- 
municative function spans the same text [26]. The conflict resolution uses a 
Bayesian network that takes one or many communicative functions as input 
and outputs the final combined expressions. The final expression may con- 
tain the meanings of all the communicative functions, creating an expression 
of complex meaning. Figure 5 illustrates how the frown of “certainty” gets 
integrated within the facial expression of “sorry- for”. 



7 Conclusions 

In this chapter, we have described the architecture of the behavior generator 
of a believable conversational agent. In particular, we focused our discussion 
on the importance of Mind-Body separation and therefore on the need for 
an interface between the two modules. Such an interface should be able to 
represent the communicative functions that can be potentially realized by 
different bodies with different expressive capabilities. We have defined two 
XML-like markup languages to represent the Mind’s output, that is the dis- 
course plans (called DPML) and the Body’s input (named APML). We have 
also described how a plan enricher transforms DPML trees into APML trees. 
Finally we have presented our language for defining facial expressions. Online 
materials (videos and examples of APML specification) may be found at [19]. 
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Fig. 5. Expression of “sorry-for” (top left), “certain” (top right), and combination 
of both expressions with conflict resolution (bottom) 
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Summary. In this chapter we propose a scripting language, called STEP, for em- 
bodied agents, in particular for their communicative acts like gestures and postures. 
Based on the formal semantics of dynamic logic, STEP has a solid semantic foun- 
dation, in spite of a rich number of variants of the compositional operators and 
interaction facilities on worlds. STEP has been implemented in the distributed logic 
programming language DLP, a tool for the implementation of 3D web agents. In 
this chapter, we discuss principles of scripting language design for embodied agents 
and several aspects of the application of STEP. 



1 Introduction 

Embodied agents are autonomous agents which have bodies by which the 
agents can perceive their world directly through sensors and act on the world 
directly through effectors. Embodied agents whose experienced worlds are 
located in real environments are usually called cognitive robots. Web agents 
are embodied agents whose experienced worlds are the web; typically they 
act and collaborate in networked virtual environments. In addition, 3D web 
agents are embodied agents whose 3D avatars can interact with each other or 
with users via web browsers [11]. 

Embodied agents usually interact with users or each other via multi-modal 
communicative acts, which can be verbal or non-verbal. Gestures, postures, 
and facial expressions are typical non-verbal communicative acts which con- 
tribute to the representation of avatars as life-like characters. In general, spec- 
ifying communicative acts for embodied agents is not easy; they often require 
a lot of geometric data and detailed movement equations for the specification 
of gestures. 

In this chapter we propose the scripting language STEP (Scripting Tech- 
nology for Embodied Persona) , in particular for communicative acts of embod- 
ied agents. At present, we focus on aspects of the specification and modeling 
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of gestures and postures for 3D web agents. However, STEP can be extended 
for other communicative acts like facial expressions or speech, and other types 
of embodied agents, like cognitive robots. Scripting languages are to a certain 
extent simplified languages which ease the task of programming and devel- 
opment. One of the main advantages of using scripting languages is that the 
specification of communicative acts can be separated from the programs which 
specify the agent architecture and mental state reasoning. Thus, changing the 
specification of communicative acts does not require reprogramming an agent. 

The avatars of our 3D web agents are built in the Virtual Reality Modeling 
Language (VRML) or X3D, the next generation of VRML. These avatars have 
a humanoid appearance. The humanoid animation working group 1 proposes 
a specification, called H-anim specification, for the creation of libraries of 
reusable humanoids in web-based applications as well as authoring tools that 
make it easy to create humanoids and animate them in various ways. H- 
anim specifies a standard way of representing humanoids in VRML. We have 
implemented the proposed scripting language for H-anim-based humanoids in 
the distributed logic programming language DLP [5] 2 . 

DLP is a tool for the implementation of 3D intelligent agents [12] 3 . In this 
chapter, we discuss how STEP can be used for embodied agents. STEP in- 
troduces a Prolog-like syntax, which makes it compatible with most standard 
logic programming languages, whereas the formal semantics of STEP is based 
on dynamic logic [9]. Thus, STEP has a solid semantic foundation, in spite 
of a rich number of variants of the compositional operators and interaction 
facilities on worlds. 



2 Principles 

We designed the scripting language primarily for the specification of commu- 
nicative acts for embodied agents; we have separated the external-oriented 
communicative acts from internal changes of the mental states of embodied 
agents because the former involves only geometric changes of the body ob- 
jects and the natural transition of the actions, whereas the latter involves 
more complicated computation and reasoning. Of course, a question is: why 
not use the same scripting language for both external gestures and internal 
agent specification? Our answer is: the scripting language is designed to be a 
simplified, user-friendly specification language for embodied agents, whereas 
the formalization of intelligent agents requires a powerful specification and 
programming language. It is not our intention to design a scripting language 
with fully functional computation facilities, as found in programming lan- 
guages like Java, Prolog, or DLP. A scripting language should be interoper- 
able with a fully powered agent implementation language, but offer a rather 

1 http://h-anim.org 

2 http://www.cs.vu.nl/~eliens/projects/logic/index.html 

3 http://wasp.cs.vu.nl/wasp 
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easy way for authoring. Although communicative acts are the result of the 
internal reasoning of embodied agents, they do not need the expressiveness 
of a general programming language. However, we do require that a scripting 
language should be able to interact with the mental states of embodied agents 
in some ways, which will be discussed in more detail later. 

We consider the following design principles for a scripting language. 

Principle 1: Convenience 

As mentioned, the specification of communicative acts, like gestures and facial 
expressions, usually involves a lot of geometric data, like ROUTE statements 
in VRML or movement equations in computer graphics. A scripting language 
should hide these geometric difficulties, so that even the authors who have 
limited knowledge of computer graphics can use it in a natural way. For ex- 
ample, suppose that authors want to specify that an agent turns its left arm 
forward slowly. This can be specified as: 

turn (Agent, left_arm, front, slow) 

It should not be necessary to specify it as follows, which requires knowledge 
of a coordinate system, rotation axis, etc.: 

turn(Agent, left_arm, rotation(l ,0 ,0 , 1 . 57) , 3) 

One of the implications of this principle is that embodied agents should 
be aware of their context; they should be able to understand what certain 
indications mean, like the directions “left” and “right”, or the body parts 
“left arm”, etc. 

Principle 2: Compositional Semantics 

Specification of composite actions based on existing components, for example 
an action of an agent which turns its arms forward slowly, can be defined in 
terms of two primitive actions, turn- left-arm and turn-right-arm: 

pax ( [turn (Agent , left.arm, front, slow), 
turn(Agent, right_arm, front, slow)]) 

Typical composite operators for actions are the sequence action seq , par- 
allel action par, and repeat action repeat , which are used in dynamic logic [9]. 

Principle 3: Redefinability 

Scripting actions (e.g. composite actions) can be defined in terms of other 
actions explicitly. The scripting language incorporates a rule-based specifica- 
tion system, where scripting actions can be defined by their own set of rules. 
These defined actions can be reused for other scripting purposes. For exam- 
ple, if we have defined two scripting actions run and kick, then a new action 
run-then-kick can be defined in terms of run and kick : 




90 



Zhisheng Huang, Anton Eliens, and Cees Visser 



run_then_kick (Agent ) = 

seq( [run(Agent) , kick(Agent)] ) . 

which can be specified in a Prolog-like syntax: 

script (run_then_kick (Agent) , Action) : - 

Action = seq( [run(Agent) ,kick(Agent)] ) . 

Principle 4: Parameterization 

Scripting actions can be adapted to be other actions; actions can be specified 
in terms of how they cause changes over time to each individual degree of 
freedom , as proposed by Perlin and Goldberg in [16]. For example, suppose 
that we define a scripting action run : we know that running can be done at 
different paces. It can be done “fast” or “slow” . It should not be necessary to 
define run actions for particular paces. We can define the action “run” with 
respect to a degree of freedom “tempo” . Changing the tempo for a generic run 
action should be enough to achieve a run action at different paces. Another 
method of parameterization is to introduce variables or parameters in the 
names of scripting actions, which allows for a similar action with different 
values. In particular, agent names and their relevant parameters are specified 
as variables in script libraries, by which the same scripting actions can be 
reused for different embodied agents under different situations by different 
authors. It would significantly improve the reusability of scripting actions for 
the purpose of productivity. This is one of the reasons why we introduce a 
Prolog-like syntax in STEP. 

Principle 5: Interaction 

Scripting actions should be able to interact with the world, including objects 
and other agents. More exactly, scripting actions can perceive the world, even 
embodied agents’ states, in order to decide whether or not the current action 
should be continued, or replaced by other actions. This kind of interaction 
can be achieved by the introduction of high-level interaction operators as 
defined in dynamic logic. The operator “test” and the operator “conditional” 
are examples of operators that facilitate the interaction between actions and 
states. 

These five principles are a guideline for the design of the scripting language 
STEP. The principle of convenience implies that STEP uses some natural- 
language-like terms for references. The principle of compositional semantics 
states that STEP has a set of built-in action operators. The principle of re- 
definability suggests that STEP should incorporate a rule-based specification 
system. The principle of parameterization justifies that STEP introduces a 
Prolog-like syntax. The principle of interaction requires that STEP is based 
on a more powerful meta-language. 
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down 

Fig. 1. Direction reference for humanoid 



3 The Scripting Language STEP 

In this section, we discuss the general aspects of the scripting language STEP. 
We propose the reference systems for STEP first. 

3.1 Reference Systems 

The reference system of STEP consists of three components: direction refer- 
ence, body reference, and time reference. 

Direction Reference 

The direction reference system in STEP is based on the H-anim specification: 
the initial humanoid position should be modeled in a standing position, facing 
in the +Z direction with +Y up and +X to the humanoid’s left. The origin 
(0, 0, 0) is located at ground level, between the humanoid’s feet. The arms 
should be straight and parallel to the sides of the body with the palms of the 
hands facing inwards toward the thighs. 

Based on the standard pose of the humanoid, we can define the direction 
reference system as sketched in Fig. 1. The direction reference system is based 
on these three dimensions: front vs. back, which corresponds to the Z-axis; 
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left_back_iip 




Fig. 2. Combination of the directions for left arm 



up vs. down, which corresponds to the Y-axis; and left vs. right, which cor- 
responds to the X-axis. Based on these three dimensions, we can introduce 
a more natural-language-like direction reference scheme: for example, turning 
left arm to “front-up” is to turn the left arm such that the front-end of the 
arm will point to the up front direction. Figure 2 shows several combinations 
of directions based on these three dimensions for the left arm. The direction 
references for other body parts are similar. 

These combinations are designed for convenience and are discussed in 
Sect. 2. However, they are in general not sufficient for more complex applica- 
tions. To solve this kind of problem, we introduce interpolations with respect 
to the mentioned direction references. For instance, the direction “left_front2” 
is referred to as one which is located between “left .front” and “left” , which is 
shown in Fig. 2. Natural-language-like references are convenient for authors 
to specify scripting actions, since they do not require the author to have a 
detailed knowledge of reference systems in VRML. Moreover, the proposed 
scripting language also supports the original VRML reference system, which 
is useful for experienced authors. Directions can also be specified to be a 
four-place tuple (Y, Y, Z , R), for example rotation^ 1, 0, 0, 1.57). 

Body Reference 

According to the H-anim standard, an H-anim specification contains a set of 
Joint nodes that are arranged to form a hierarchy. Each Joint node can contain 



STEP: a Scripting Language for Embodied Agents 



93 



other Joint nodes and may also contain a Segment node which describes the 
body part associated with that joint. Each Segment can also have a number 
of Site nodes , which define locations relative to the segment. Sites can be used 
for attaching accessories, like hat, clothing, and jewelry. In addition, they can 
be used to define eye points and viewpoint locations. Each Segment node 
can have a number of Displacer nodes that specify which vertices within the 
segment correspond to a particular feature or configuration of vertices. 



Humanoid Root 



skull base 




Fig. 3. Typical joints for humanoid 



Figure 3 shows several joints of humanoids. Turning body parts of hu- 
manoids implies the setting of the corresponding joint’s rotation. Moving the 
body parts means the setting of the corresponding joint’s position. For in- 
stance, the action “turning the left-arm to the front slowly” is specified as: 

turn (Agent, l_shoulder, front, slow) 

Based on the H-anim specification, all body joints are contained in a hi- 
erarchical structure. Accordingly, the direction reference of a body joint in 
STEP is measured relative to the default rotations of its ancestor joints in the 
hierarchy. For instance, Fig. 4(a) shows the posture of the left elbow joint to 
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the direction “front” relative to the default posture of the avatar. However, 
when the left shoulder joint or one of its parent joints point to the direction 
“front” , the left elbow joint pointing to “front” results in a posture in which 
the left hand points to the direction “up”, as shown in Fig. 4(b). In practice, 
this kind of direction reference does not cause difficulties for authoring, for 
the correct direction can be obtained by reducing the directions of its ances- 
tor body parts to be the default ones. Therefore, STEP is well suited for a 
forward kinematic system. Moreover, we would like to point out that STEP 
can also be used to solve inverse kinematic problems. That will be shown in 
Sect. 4. 




Fig. 4. Elbow joint in different situations 



Time Reference 

STEP has the same time reference system as VRML. For example, the action 
turning the left arm to the front in 2 seconds can be specified as: 

turn (Agent, l_shoulder, front, time (2, second)) 

This kind of explicit specification of duration in scripting actions does not 
satisfy the parameterization principle. Therefore, we introduce a more flexible 
time reference system based on the notions of beat and tempo. A beat is a 
time interval for body movements, whereas the tempo is the number of beats 
per minute. By default, the tempo is set to 60, i.e. a beat corresponds to a 
second. However, the tempo can be changed. Moreover, we can define different 
speeds for body movements: for example, the speed “fast” can be defined as 
one beat, whereas the speed “slow” can be defined as three beats. 
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3.2 Primitive Actions and Composite Operators 

Turn and move are the two main primitive actions for body movements. Turn 
actions specify the change of the rotations of the body parts or the whole 
body over time, whereas move actions specify the change of the positions of 
the body parts or the whole body over time. A turn action of a body part is 
defined as follows: 

turn ( Agent ,BodyPart , Direct ion, Duration) 

where Direction can be a natural-language-like direction like “front” or a ro- 
tation value like “rotation(l,0,0,3.14)”, and Duration a speed name like “fast” 
or an explicit time specification, like “time(2, second)”. 

A move action of a body part is defined as: 

move (Agent ,BodyPart , Direct ion, Duration) 

where Direction can be a natural-language-like direction like “front”, a 
position value like “position(l,0,10)”, or an increment value like “incre- 
ment^, 0, 0)” . The turn and move actions of the whole body are defined as 
follows: 

turn_body (Agent , Direct ion, Duration) 
move _body (Agent , Direct ion, Duration) 

Typical composite operators for scripting actions are: 

• Sequence operator “seq”: the action seq( [Actioni , . . . ,Action n ~\) denotes 
a composite action in which Actioni, . . . , Action n are executed sequentially: 

seq( [turn (agent , 1_ shoulder , front , fast) , 
turn (agent , r .shoulder , front , fast)] ) 

• Parallel operator “par”: the action par( [Action i , . . . ,Action n ~\ ) denotes 
a composite action in which Actioni , . . . , Action n are executed simultane- 
ously. 

• Non-deterministic choice operator “choice”: the action choice ( [Actioni , 

. . . , Action n ]) denotes a composite action in which one of the Actioni , 

. . . , Action n is executed. 

• Repeat operator “repeat”: the action repeat (Action, T) denotes a com- 
posite action in which the Action is repeated T times. 

3.3 STEP and Dynamic Logic 

STEP is based on dynamic logic [9] and allows for arbitrary abstractions using 
the primitives and composition operators provided by the logic. In dynamic 
logic, there is a clear distinction between an action and a state. Semantically, 
a state represents the properties at a particular moment, whereas an action 
consists of a set of state pairs, which represent a relation between two states. 
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Thus, there are two sub-languages in dynamic logic: a sub-language for ac- 
tions and a sub-language for states. The latter is called the meta language of 
dynamic logic. Let a be an action represented in the action sub-language, and 
*0 and (p the property formulas represented in the meta language. In dynamic 
logic, a formula like 

ip [a]4> 

means that if the property xp holds, then the property <p holds after doing the 
action a. The formula above states a relation between the pre-condition ip and 
the post-condition <p for the action a. 

A scripting language based on the semantics of dynamic logic is well suited 
for the purpose of intelligent embodied agents. As discussed previously, the 
scripting language is primarily designed for the specification of body lan- 
guage and speech for embodied agents. In this framework, the specification 
of external-oriented communicative acts can be separated from the internal 
states of embodied agents because the former involves only geometric changes 
of the body objects and the natural transition of the actions, whereas the 
latter involves more complicated computation and reasoning. 

Dynamic logic has several primitive action operators: “a; (3 y ‘ means that 
a is executed before /3; u a U /?” means that either a or (3 is executed 
non-deterministically; “a*” means that a is executed a finite, but non- 
deterministic number of times; and p? means to proceed if p is true, else fail. 
Based on these primitive action operators, some typical actions are relatively 
easy to define [9], for example: 

if p then a else (3 as (p?; a) U (-ip?; (3) 

while p do a as (p?; a)*; -np? 

repeat a until p as a(-i p?; a:)*;p? 

IF p — )• a || q (3 FI as (p?; a) U ( q ?; (3) 

Therefore, based on the formal semantics of dynamic logic, STEP has a solid 
semantic foundation, in spite of a rich number of variants of the compositional 
operators. Refer to [13] for more details of the semantics issues about STEP. 



3.4 High-Level Interaction Operators 

When using high-level interaction operators, scripting actions can directly 
interact with internal states of embodied agents or with external states of 
worlds. These interaction operators are based on a meta language which is 
used to build embodied agents, say, in the distributed logic programming 
language DLP. In the following, we use lower case Greek letters </>, ip, x to 
denote formulas in the meta language. Similar to those in dynamic logic, STEP 
has the following higher level interaction operators: 

• test: test (0), check the state (p. If cp holds then skip, otherwise fail. 

• execution: do (<p) , make the state <p true, i.e. execute <p in the meta language. 




STEP: a Scripting Language for Embodied Agents 97 

• conditional: if _then_else(</>, actioni 9 actionp ) . 

• until: until (action ,<£), perform action until <f> holds. 

The above-mentioned action operators are sufficiently powerful to define a 
number of variants of scripting actions. In particular, the execution operator 
“do” is used to access certain computation and interaction capabilities from 
the meta language level. In DLP and Prolog, the predicate “is” is for the eval- 
uation of arithmetic expressions. Accordingly, actions which involve the “do” 
operator and the predicate “is”, like do(N is sqrt(S )), can be used to perform 
computations in STEP. Actions with the ‘do’ operator in combination with 
the VRML/X3D EAI predicates in DLP, like do(getPosition( Agent, X , Y, Z)) 
and do(setRotation(Obj ect, X, Y, Z, i?)), can be used to interact with virtual 
worlds. The same patterns of actions in combination with the available com- 
munication predicates at the meta language level can be used to achieve cer- 
tain communication facilities between embodied agents. We will discuss some 
details of how these capabilities can be achieved in Sect. 4. Before doing so, 
we will describe a brief example of how a number of temporal relations can 
be defined in terms of the parallel action operator “par” and the sequential 
action operator u seq ” by means of the execution operator “do” . As discussed 
in [1, 2], there are 13 possible temporal relations between two actions, that is 
before , meets, overlaps , starts , during , finishes , equals , and their inverse 
relations. All these 13 possible temporal relations can be defined in STEP 
[13], for example: 

before (A1 ,A2)= seq([Al, do (random (N) ) , wait(N),A2]) 
meets (A1 ,A2)= seq(Al,A2) 

overlaps (A1 ,A2)= par( [Al,seq( [duration(Al ,T1) , do (random (R) ) , 
do(N is T1*R) , wait(N), A2])]) 
starts (A1 ,A2)= par([Al,A2]) 

where duration(A , T) calculates the duration T for the action A, which can 
be defined recursively on the sub- actions of A. wait(N) is a special action 
which does nothing but just wait for N seconds. The action wait(N) can be 
defined as seq([do(T is N * 1000), do(sleep(T))]) 4 . See [13] for more details 
with respect to the expressiveness of STEP and its semantics. 

We have implemented the scripting language STEP in the distributed logic 
programming language DLP. See [14] for implementation issues of STEP. 
Based on STEP, we have also implemented XSTEP [15], the XML-based 
markup language for embodied agents. 



4 Examples 

In this section, we discuss several examples of how STEP can be used to de- 
fine scripting actions for embodied agents. The first two examples “walk” and 



4 Because the predicate sleep in DLP requires milliseconds. 
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“run” describe general examples of body movements of embodied agents. The 
third examples “look at ball” and “run to ball” describe actions which demon- 
strate the interaction between agents and virtual worlds. Finally, in the fourth 
example “touch” , we discuss how STEP can be used to solve some inverse kine- 
matics problems for embodied agents. The first two examples demonstrate how 
users can use STEP easily. The third and the fourth examples require some 
knowledge of 3D geometry. They are designed for professional users. 

4.1 Walk and its Variants 





A walking posture can be expressed as a movement which consists of the 
following two main activities: an action in which the left arm/right leg move 
forward while the right arm/left leg move backward, and an action in which the 
right arm/left leg move forward while the left arm/right leg move backward. 
The main poses and their linear interpolations are shown in Fig. 5. The walk 
action can be described in the scripting language as follows: 

script (walk_pose (Agent) , Action) 

Action = seq([par([ 

turn (Agent , r_shoulder , back_down2 , f ast ) , 
turn ( Agent ,r_hip,f ront_down2, fast) , 
turn (Agent , l.shoulder , front _down2 , f ast) , 
turn (Agent , l_hip , back_down2 , f ast ) ] ) , 
par ( [turn (Agent , l_shoulder , back_down2 , f ast ) , 
turn (Agent , l_hip , front _down2 , f ast) , 
turn ( Agent , r .shoulder , f r ont _down2 , f as t ) , 
turn (Agent , r _hip , back_down2 , f ast ) ] ) ] ) . 

As shown below, a walk step can be described as a parallel action which 
consists of the walking posture and the moving action (i.e. changing position): 

script (walk.f orward.step (Agent) , Action) 

Action= par ( [walk.pose (Agent) , 

move (Agent , front , fast)] ) . 
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The step length can be a concrete value. For example, for a 0.7 meter step 
size, it can be defined as: 

script (walk_forward_step07 (Agent) , Action) 

Action= par ( [walk_pose( Agent ) , 

move (Agent, increment (0.0, 0.0, 0.7) ,fast)] ) . 

Alternatively, the step length can also be a variable: 

s cript (walk_f orward_step0( Agent , St epLength) , Action) 

Action = par ( [walk_pose( Agent ) , 

move (Agent , increment (0 . 0 , 0 . 0 , StepLength) , f ast) ] ) . 

Therefore, walking forward N steps with a particular StepLength can be 
defined as follows: 

script (walk_forward( Agent , St epLength, N) , Action) : - 

Action = repeat (walk_forward_step0 (Agent , StepLength) ,N) . 

As mentioned above, animations of the walk action based on these defini- 
tions are simplified and approximated ones. As analyzed in [7, 20], a realistic 
animation of walk motions of a human figure involves many computations 
which rely on a robust simulator where forward and inverse kinematics are 
combined with automatic collision detection and response. It is not our in- 
tention to use the scripting language to achieve a fully realistic animation of 
the walk action, because it is seldom necessary for most web applications. 
However, we would like to point out that there does exist the possibility to 
accommodate some inverse kinematics to improve the realism by using the 
scripting language. 




Fig. 6. Poses of run 
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4.2 Run and its Deformation 

As a first approximation, the action “run” is similar to the action “walk”, 
however, with bending arms and legs. The latter would make the legs look like 
lifting from the ground, which is an important difference between the action 
“walk” and the the action “run” [19]. The run pose is shown in Fig. 6(a). As 
we can see from the figure, the left lower arm points to the direction “front-up” 
when the left upper arm points to the direction “front_down2” during the run 
action. Considering the hierarchies of the body parts, we should not use the 
primitive action turn(Agent , Lelbow , front-up , fast ) but the primitive action 
tur n{ Agent, l .elbow, front, fast ) , because the direction of the left lower arm 
should be defined relative to the direction of its parent body part, i.e. the 
left arm (more exactly, the joint Lshoulder). This kind of redirection does not 
impose major difficulties for authoring, because the correct direction can be 
obtained by reducing the directions of its parent body parts to be the default 
ones. As we can see in Fig. 6(b), the lower arm actually points to the direction 
“front” . 

Based on the action “walk”, the action “run_pose” can be defined as an 
action which starts with a run pose as shown in Fig. 6(b) and then repeat the 
action “walk_pose” for N times: 

script (basic_run_pose (Agent) , Action) : - 
Action=par( [turn ( Agent ,r_elbow, front , fast) , 
turn(Agent, l_elbow, front, fast), 
turn(Agent, l_hip, front_down2, fast), 
turn(Agent, r_hip, front_down2, fast), 
turn(Agent, l_knee, back_down, fast), 
turn(Agent, r_knee, back_down, fast)]). 

script (run_pose (Agent , N) , Action) : - 

Action = seq( [basic_run_pose (Agent) , 
repeat (walk_pose( Agent) ,N)]) . 

Therefore, the action running forward N steps with a particular StepLength 
can be defined in the scripting language as follows: 

script (run (Agent , StepLength, N) , Action) 

Action=seq( [basic_run_pose (Agent) , 
walk_f orward( Agent , StepLength, N)] ) . 

In practice, the action “run” (Fig. 7) may have many variants. For instance, 
the lower- arm may point to different directions; it does not necessarily point to 
the direction “front”. Therefore, we may define the action “run” with respect 
to certain degrees of freedom. Here is an example to define a degree of freedom 
with respect to the angle of the lower arms to achieve the deformation: 

script (basic_run_pose_elbow (Agent ,Elbow_Angle) , Action) 

Action = par([ 

turn(Agent,r_elbow,rotation(l,0,0,Elbow_Angle) ,fast) , 
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Fig. 7. Run 



turn (Agent ,l_elbow,rotation(l ,0,0, Elbow_Angle) ,fast) , 

turn ( Agent ,l_hip, front _down2, fast) , 

turn (Agent ,r_hip,front_down2,f ast) , 

turn (Agent , l_knee , back.down , f ast) , 

turn (Agent , r _knee , back_down , f ast ) ] ) . 

script (run_e (Agent , St epLength,N, Elbow. Angle) , Action) 

Action = seq( [basic_run_pose_elbow(Agent ,Elbow_Angle) , 
walk.forward (Agent , StepLength, N)]). 



4.3 Interaction with Virtual Worlds 



In this section we want to show how the interaction between embodied agents 
and virtual worlds can be achieved by using the high-level interaction oper- 
ators. Consider a situation in which there are several agents and a ball. The 
position of the ball is always changing because other agents may kick the ball. 
We want to design the script actions for embodied agents so that they can 
always look at the ball and run to the ball no matter where the ball is located. 

In the following, we suppose that the meta language of the scripts is 
DLP. Other languages can be used following the same strategy. Using DLP’s 
VRML/X3D predicates, we can manipulate 3D objects in virtual worlds. For 
example, given the current position of the embodied agent and the ball, we 
can always calculate the new rotation of the agent so that it will look at the 
ball. By using the high-level interaction operator do with the built-in opera- 
tors in the meta language we can define the script action “look at ball” and 
other relevant actions. 

First we want to define a scripting action “turn_to_direction” which trans- 
forms a source direction vector into a destination direction by means of par- 
ticular vector processing predicates. We know that the result of a vector cross 
product of two vectors v\ and V 2 is a normal vector, i.e. a vector that is per- 
pendicular to the original vectors v\ and v >2 • Such a normal vector defines the 
axis of the rotation and the corresponding angle 6 between these two vectors 
can be calculated by the following formula: 

a vi-v 2 

cos 0 = - — j : — 7 

\Vl\ X \V2\ 

Therefore, a scripting action “turn to direction” can be defined as: 
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Fig. 8. Look at ball 



script (turn_to_direction(Object ,SrcVector ,DestVector) , Action) : - 
Action = seq([ 

do (vector_cross_product (SrcVector , DestVector , vector (X , Y , Z) , R) ) , 
do (setRot at ion (Object ,X,Y,Z,R) )] ) . 

where the predicate vector .cross jproduct(S, D , V, R ) calculates the cross prod- 
uct V of the vector S and the vector D, as well as the angle R between the 
two vectors. 

In general, embodied agents turn to the ball along the XZ plane, there- 
fore we can ignore the Y-parameters. The Y-parameters are useful only when 
we want to calculate a rotation for the agent’s head so that it can look 
down to the ball. H-anim avatars always face to the +Z direction by default. 
Thus, the source vector is (0,0, 1). The destination vector can be calculated 
from the positions of the agent and the ball. Therefore, the scripting action 
“look_at_position” can be defined as follows: 

script (look_at_position (Agent , XI, _Y1,Z1) , Action) 

Action = seq( [do (getPosit ion ( Agent, X,_Y,Z)) , 
do(Xdif is Xl-X), 
do(Zdif is Zl-Z), 

turn_to_direction (Agent , vector (0 . 0 , 0 . 0 , 1 . 0) , 
vector (Xdif ,0.0, Zdif ) ) ] ) . 

Based on the scripting action “look_at_position” , the scripting action “look_at_ball” 
(Fig. 8) can be easily defined as follows: 

script (look_at_ball (Agent , Ball) , Action) 

Action = seq( [do (getPosition(Ball , X1,Y1,Z1)), 

look_at .position (Agent ,X1 , Y1 ,Z1)] ) . 

In the following, we want to define a script action run Jo -ball (Agent, Ball , N) 
so that the agent can continually run to the ball in N steps. Similarly we 
use the do-operator to obtain the current position of the agent and the ball 
first, from which we can calculate the increments of the positions in X and Z 
dimensions: 
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script (run_to_ba.ll (Agent , Ball, Steps) , Action) 

Action = seq( [do (getPosit ion (Agent, X,_,Z)) , 
do(getPosition(Ball, X1,_,Z1)), 
do(StepLengthX is (Xl-X) /Steps) ) , 
do(StepLengthZ is (Zl-Z) /Steps) ) , 
run.steps (Agent , increment (StepLengthX ,0.0, 
StepLengthZ) , Steps) ] ) . 

The scripting action r un step s{ Agent, Increment, N) describes an action 
in which the agent changes its position in N steps. This action can be defined 
as a recursive action: 

script (run_steps (Agent , increment (X,Y,Z) , 1) , Action) 

Action = pax ( [run_pose( Agent ) , 

move (Agent , increment (X , Y , Z) , f ast ) ] ) . 

script (run_steps (Agent , increment (X , Y , Z) , Steps) , Action) : - 
Action = seq( [par ( [run_pose( Agent ) , 

move (Agent , increment (X , Y , Z) , f ast ) ] ) , 
do(Stepsl is Steps - 1), 

run_steps (Agent , increment (X,Y,Z) ,Stepsl)] ) . 

4.4 Touch: an Inverse Kinematics Problem 

A typical inverse kinematics problem is the calculation of the rotations of 
arms and wrists of embodied agents so that their hands can touch an object. 
As discussed in [20], many research efforts deal with this kind of problem. 
Finding solutions to this kind of inverse kinematics problem usually involves 
complex computations, like solving differential equations or applying particu- 
lar non-linear optimizations [4, 20]. As discussed above, we can use high-level 
interaction operators to access the computational capabilities of the meta 
language in order to find the solutions by using the same methods which 
have been proposed in the literature. However, adopting these analytical and 
numerical methods to solve inverse kinematics problems may cause some per- 
formance problems for web applications. Therefore, one of our concerns is to 
find an acceptable trade-off between performance and realistic animations. 

To illustrate this, we will discuss a “touch” example in more detail to 
show how the scripting language STEP can be used to solve some real-time 
inverse kinematics problems with a satisfying performance result. To simplify 
the problem, embodied agents are designed to behave like this: they will touch 
an object by using their hands if the object is reachable, otherwise they will 
point their hands in the direction of the object. In addition, we will ignore 
the upper and lower limits of the rotations of the shoulder and elbow joints. 
In particular, we assume that the elbow joint has enough degrees of freedom 
for an appropriate solution. 

This simplified “touch” problem can be described as: given an agent Agent 
and a position (xq, 2/o? ^o) of an object, try to set the rotations of the joints of 
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the shoulder and the elbow so that the hand of the agent can touch exactly 
the position if the position is reachable. Suppose that the length of the upper 
arm is u, the length of the forearm is /, and the distance between the shoulder 
center (#3, 2/3 , 23) and the destination position ( xo,yo,zo ) is d. The position 
(xo,^o,^o) is reachable if and only if d < u + / if we ignore the upper and 
lower limits of the joint rotations. From the cosine law we know that if the 
object is reachable, then a, the angle between the upperarm and the forearm, 
can be calculated from: 



Furthermore, if v is the direction vector which points to the destination posi- 
tion from the shoulder center, Vo the default direction vector of the arm, and 
v\ the destination direction vector of the upper arm (Fig. 9 ), then the angle 
/ 3 between the vector v and V\ is given by: 





if the object is within the agent’s reach. 



V 






/ 
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Fig. 9. Inverse kinematics of touch 
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If the position is not reachable, then a = n and (3 = 0 so that the arm will 
point to the direction of the destination position. Moreover, if d « 0, then the 
destination position is close to the shoulder center. In this case, we set a = 0 
and (3 = 0. We can define a scripting action to realize the functions for a and 
(3 as follows 5 : 

script (get ABvalue( Agent , posit ion (X0, Y0,Z0) ,Hand, A,B) , Action) : - 
Action = seq([ 

getDvalue ( Agent , position (X0, Y0,Z0) ,Hand, D) , 
get .upper arm_length (Agent , LI) , 
get_f orearm.length (Agent ,L2) , 
do (D1 is LI + L2) , 

if _then_else (sign(Dl-D) >sign(0 . 001-D) , 
seq( [do(cosine_law(Ll ,L2,D, A)) , 

do(cosine_law(Ll , D, L2, B))]), 
seq([do(A is 1 . 57* (l+sign(D-0 . 001) ) ) , 
do(B is 0.0)]))]) . 

The predicate getDvalue is an action which calculates the distance D 
between the shoulder center and the destination touch position for an agent. 
Suppose that the destination position {xq, Vo, ^o) is relative to the coordinate 
system of the agent body at which the agent is positioned in the default 
position and orientation of H-anim avatars; that is, it faces in the +Z direction 
at the position (0, 0, 0). The action getDvalue can be defined by obtaining the 
positions of the shoulder. In the following, we will define a “touch” action for 
relative positions first. We call the “touch” action for agents with arbitrary 
position and arbitrary orientation a “touch” action for an absolute position. 
We will show how the “touch” action for absolute positions can be based on 
a “touch” action for relative positions. 

The cross product vq x v, be. a normal vector n = (x n ,y n , z n ) , can be 
considered as a normal vector for vq and v\, which defines the plane in which 
the arm turns from its default rotation to a destination rotation. This means 
that we require that the vector v\ is in the same plane as the vectors v and Vo 
so that the arm will turn close to the destination position via the shortest path. 
The angle 7 between vo and v can be calculated with the vector predicates, 
like those that were used in the last example. Thus, the rotation for the 
elbow joint is (x n , y n , z n , tt — a), and the rotation for the shoulder joint is 
(x n ? 2/n 5 Zn 5 T P) • 

The vector v can be calculated by using the following script, considering 
that the destination position is a relative one: 

script (getVvalue (Agent ,position(X0,Y0,Z0) ,Hand,V) , Action) 

Action = seq( [ 

get _shoulder_center (Agent , Hand, posit ion (X2, Y2,Z2) ) , 

5 The predicate g etABvalue( Agent, position(X0,Y0, Z0), Hand, A, B) means that 
for the agent Agent and the destination position (A0, Y0,Z0) of the Hand, the 
value of a is A, and the value of (3 is B. 
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do (direct ion_vector (posit ion (X2,Y2,Z2) , posit ion (X0,Y0,Z0) ,V) )] ) . 

where the predicate get .shoulder .center gets the position of the shoulder cen- 
ter, and the predicate direct ion.vect or obtains a direction vector of the two 
positions. It is easy to define these two predicates at the STEP level. However, 
the predicate direction.vector is already available in DLP in order to obtain 
a better performance. 

Now, we define the scripting action “touch” for relative positions with the 
left hand as follows: 

script (touch (Agent ,position(X0, Y0,Z0) ,1) , Action) 

Action = seq([ 

get ABvalue (Agent , posit ion (X0,Y0,Z0) ,1,A,B) , 
do (El is 3.14-A), 

getVvalue (Agent ,position(X0, Y0,Z0) ,1,V) , 
get_arm_ vector (Agent , 1 , VO) , 

do (vector_cross_product (VO, V, vector (X3,Y3,Z3) ,C)) , 
do(R2 is C-B) , 

par( [turn ( Agent ,l_shoulder, rotat ion (X3,Y3,Z3,R2) ,fast) , 
turn(Agent ,l_elbow,rotation(X3,Y3,Z3,Rl) ,fast) , 
turn(Agent ,l_wrist, rotat ion (X3,Y3,Z3, -0.5) ,fast)] )] ) . 

Although we do not calculate the rotation for the wrist joint, we can 
adjust the rotation of the wrist joint based on the same normal vector, so 
that the hand can rotate a little bit to the position to achieve more realism. 
The “touch” action with the right hand can be defined similarly. 

Finally, we can define the “touch” action for absolute positions in terms 
of the “touch” action for relative positions, by the translation of the absolute 
position into a relative position, based on the agent’s current position and 
orientation. 

script (touch_absolutePosit ion (Agent ,position(Xl , Y1 ,Z1) ,Hand) , Action) 
Action = seq( [do (getPosit ion (Agent ,X, Y,Z) ) , 
do (getRotat ion (Agent , X2,Y2,Z2,R)) , 
do(X3 is Xl-X), 
do(Y3 is Yl-Y) , 
do(Z3 is Zl-Z), 
do(Rl is -R), 

do (position_rotation(position(X3 , Y3 , Z3) , 

rotation (X2 , Y2 , Z2 , R1 ), posit ion (X4 , Y4 , Z4) )) , 
touch(Agent ,position(X4, Y4,Z4) , Hand)] ) . 

where the predicate position jrotation(P\, R, P2) gets the new position P2 
for a given position PI after the rotation R. 

Several touch situations based on this scripting action are shown in Fig. 10. 
The tests show that STEP does not cause serious performance problems for 
this kind of inverse kinematics problem. Currently the computation time for 
each touch action is less than 50 milliseconds on a PC with a 500 MHz CPU 
and 128 MB memory, a low-end computer nowadays, under Windows NT 




